1 Introduction

Kaggle describe this competition as follows:

People interested in renting an apartment or home, share information about themselves and their property on Airbnb. Those who end up renting the property share their experiences through reviews. The dataset describes property, host, and reviews for over 40,000 Airbnb rentals in New York along 90 variables.*

2 Loading Data

2.1 Loading libraries and read two dataset

Loading R packages used besides base R.

library(knitr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(corrplot)
## corrplot 0.90 loaded
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(rapport)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(cattonum)
## cattonum is seeking a new maintainer; please respond if interested: https://github.com/bfgray3/cattonum/issues/40
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(skimr)
library(vtreat)
## Loading required package: wrapr
## 
## Attaching package: 'wrapr'
## The following object is masked from 'package:dplyr':
## 
##     coalesce
## The following objects are masked from 'package:tidyr':
## 
##     pack, unpack
## The following object is masked from 'package:tibble':
## 
##     view

Reading two csv files as dataframes into R.

data = read.csv("/Users/chensixian/Desktop/CU AA/5200/Kaggle/rentlala2021/analysisData.csv")
ScoringData = read.csv("/Users/chensixian/Desktop/CU AA/5200/Kaggle/rentlala2021/ScoringData.csv")

2.2 Inspect Data and drop some columns

All variables should be carefully examined, understand what they means could help a lot in data cleaning and feature engineering.

str(data)
## 'data.frame':    41330 obs. of  91 variables:
##  $ id                                          : int  914965 262763 279053 385319 476558 522365 803935 373951 729674 372028 ...
##  $ name                                        : chr  "<BRAND NEW BUILDING! 2 BR 1. 5 BA in Trendy Bushwick>" "<Brooklyn Brownstone parlor living with full A/C>" "<Posh one bedroom apartment in Brooklyn>" "<COMFY ROOM in Chelsea for long term>" ...
##  $ summary                                     : chr  "Trendy,  artsy buswick.   Go where the NYC artists are!  Cafe's,  bars,  and the subway to downtown is a 4 minu"| __truncated__ "Full parlor floor in a historic Brooklyn Brownstone in Prospect Heights,  we are within blocks of Prospect Park"| __truncated__ "Brand new super clean and glamorous one bedroom garden apartment with central air/heat in the heart of trendy P"| __truncated__ "*Beautiful room available for long term (prefer >30days)*  - Perfect for 1 person,  possible but not recommende"| __truncated__ ...
##  $ space                                       : chr  "There is a full bath and master suite bedroom with Queen bed on the first floor.  Living room with dining area "| __truncated__ "" "The apartment has its own entry/exit,  is impeccably clean and is equipped with all basic necessities including"| __truncated__ "The kitchen is spacious and have enough equipments for you to cook any basic dishes!  There’s a dining table wi"| __truncated__ ...
##  $ description                                 : chr  "Trendy,  artsy buswick.   Go where the NYC artists are!  Cafe's,  bars,  and the subway to downtown is a 4 minu"| __truncated__ "Full parlor floor in a historic Brooklyn Brownstone in Prospect Heights,  we are within blocks of Prospect Park"| __truncated__ "Brand new super clean and glamorous one bedroom garden apartment with central air/heat in the heart of trendy P"| __truncated__ "*Beautiful room available for long term (prefer >30days)*  - Perfect for 1 person,  possible but not recommende"| __truncated__ ...
##  $ neighborhood_overview                       : chr  "" "" "Short and easy walk to: - Grand army plaza - Prospect Park - Brooklyn botanic gardens - Brooklyn museum - Atlan"| __truncated__ "There are many very good restaurants in the neighborhood.  Chipotle,  Pizza and very good ramen and sushi are j"| __truncated__ ...
##  $ notes                                       : chr  "" "" "While the household and building is rather academic and we socialize a fair bit,  we stay away from loud partie"| __truncated__ "I have a secret back yard so if you’re an awesome guest I will invite you out there for tea or wine! :)" ...
##  $ transit                                     : chr  "The J/Z subway is a 4 minute walk and the L train is a 7 minute walk.   25 minutes into downtown Manhattan!" "" "Several subway stops are within a short walk; the closest line is 2 blocks away. " "Parking is available for free right in the front after 7pm,  and 1 hour meter is from 7:30am to 7 pm.   Across "| __truncated__ ...
##  $ access                                      : chr  "" "There will be a key lockbox on the front gate.  Walk up the brownstown front stairs and open the door by turnin"| __truncated__ "Access to a full size washer and dryer (detergent provided)" "Bathroom and living room" ...
##  $ interaction                                 : chr  "I live locally,  so I am available. " "You will have access and use of 2 bedrooms which includes a queen sized bed,  air mattress,  and extra-long couch. " "We are a low key couple and live in the building and available as much or as little as needed. " "I’m available anytime if you have any questions! I would love to get to know you and hear your stories but if y"| __truncated__ ...
##  $ house_rules                                 : chr  "" "No parties! We expect any guest to be added to the reservation in advance.  Please respect this home as your ow"| __truncated__ "Smoking and vaping (e-cigarettes) anywhere on the premises is STRICTLY PROHIBITED;  Overnight visitors NOT allo"| __truncated__ "Please always clean up after yourself when you use the bathroom or kitchen,  take your own trash out and keep t"| __truncated__ ...
##  $ host_name                                   : chr  "Andrea" "Jeremy" "Foster" "Sophie" ...
##  $ host_since                                  : chr  "2019-06-13" "2011-06-08" "2013-01-20" "2014-08-18" ...
##  $ host_location                               : chr  "US" "New York, New York, United States" "New York, New York, United States" "Ossining, New York, United States" ...
##  $ host_about                                  : chr  "" "Jeremy and Kristy live in Brooklyn and East Hampton NY. Jeremy is in technology/finance and Kristy is in archit"| __truncated__ "" "" ...
##  $ host_response_time                          : chr  "within a few hours" "within an hour" "within a day" "within an hour" ...
##  $ host_response_rate                          : chr  "70%" "86%" "100%" "100%" ...
##  $ host_acceptance_rate                        : chr  "N/A" "25%" "66%" "98%" ...
##  $ host_is_superhost                           : chr  "f" "f" "t" "f" ...
##  $ host_neighbourhood                          : chr  "Bushwick" "" "Prospect Heights" "Greenwich Village" ...
##  $ host_listings_count                         : int  0 1 1 3 6 1 1 221 1 2 ...
##  $ host_total_listings_count                   : int  0 1 1 3 6 1 1 221 1 2 ...
##  $ host_verifications                          : chr  "['phone']" "['email', 'phone', 'facebook', 'reviews', 'jumio', 'government_id', 'work_email']" "['email', 'phone', 'reviews', 'jumio', 'government_id']" "['email', 'phone', 'reviews', 'jumio', 'government_id']" ...
##  $ host_has_profile_pic                        : chr  "t" "t" "t" "t" ...
##  $ host_identity_verified                      : chr  "f" "t" "f" "t" ...
##  $ street                                      : chr  "Brooklyn, NY, United States" "New York, NY, United States" "Brooklyn, NY, United States" "New York, NY, United States" ...
##  $ neighbourhood                               : chr  "Brooklyn" "Brooklyn" "Prospect Heights" "Manhattan" ...
##  $ neighbourhood_cleansed                      : chr  "Bushwick" "Prospect Heights" "Prospect Heights" "Greenwich Village" ...
##  $ neighbourhood_group_cleansed                : chr  "Brooklyn" "Brooklyn" "Brooklyn" "Manhattan" ...
##  $ city                                        : chr  "Brooklyn" "New York" "Brooklyn" "New York" ...
##  $ state                                       : chr  "NY" "NY" "NY" "NY" ...
##  $ zipcode                                     : chr  "11207" "11238" "11238" "10018" ...
##  $ market                                      : chr  "New York" "New York" "New York" "New York" ...
##  $ smart_location                              : chr  "Brooklyn, NY" "New York, NY" "Brooklyn, NY" "New York, NY" ...
##  $ country_code                                : chr  "US" "US" "US" "US" ...
##  $ country                                     : chr  "United States" "United States" "United States" "United States" ...
##  $ is_location_exact                           : chr  "t" "f" "t" "t" ...
##  $ property_type                               : chr  "Condominium" "Apartment" "Guest suite" "Apartment" ...
##  $ room_type                                   : chr  "Entire home/apt" "Entire home/apt" "Entire home/apt" "Private room" ...
##  $ accommodates                                : int  5 4 3 2 2 2 1 6 2 8 ...
##  $ bathrooms                                   : num  1.5 1 1 1 1.5 1 1 2 1 1.5 ...
##  $ bedrooms                                    : int  2 2 1 1 1 1 1 2 1 3 ...
##  $ beds                                        : int  2 1 1 1 1 1 1 3 1 4 ...
##  $ bed_type                                    : chr  "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
##  $ amenities                                   : chr  "TV,Wifi, Air conditioning ,Kitchen,Heating,Washer,Dryer, Smoke detector , Carbon monoxide detector , Fire extin"| __truncated__ "TV,Internet,Wifi,Kitchen, Pets live on this property ,Dog(s), Free street parking , Buzzer/wireless intercom ,H"| __truncated__ "TV,Wifi, Air conditioning ,Kitchen,Heating,Washer,Dryer, Smoke detector , Carbon monoxide detector , First aid "| __truncated__ "Wifi, Air conditioning ,Kitchen, Buzzer/wireless intercom ,Heating, Smoke detector , Carbon monoxide detector ,"| __truncated__ ...
##  $ square_feet                                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ price                                       : int  145 199 165 59 48 80 58 339 65 119 ...
##  $ weekly_price                                : int  NA NA NA NA NA NA NA NA 650 NA ...
##  $ monthly_price                               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ security_deposit                            : int  0 399 0 0 0 NA NA 0 0 100 ...
##  $ cleaning_fee                                : int  110 50 75 49 25 45 NA 99 50 95 ...
##  $ guests_included                             : int  1 2 2 1 1 1 1 1 1 7 ...
##  $ extra_people                                : int  0 50 50 25 15 0 0 0 10 25 ...
##  $ minimum_nights                              : int  6 2 2 30 2 2 3 3 5 3 ...
##  $ maximum_nights                              : int  29 1125 12 1125 1125 30 31 1125 32 3 ...
##  $ minimum_minimum_nights                      : int  6 2 1 30 2 2 3 3 5 3 ...
##  $ maximum_minimum_nights                      : int  6 2 3 30 2 2 3 3 5 3 ...
##  $ minimum_maximum_nights                      : int  29 1125 12 1125 1125 30 31 1125 32 3 ...
##  $ maximum_maximum_nights                      : int  29 1125 12 1125 1125 30 31 1125 32 3 ...
##  $ minimum_nights_avg_ntm                      : num  6 2 2 30 2 2 3 3 5 3 ...
##  $ maximum_nights_avg_ntm                      : num  29 1125 12 1125 1125 ...
##  $ calendar_updated                            : chr  "2 weeks ago" "3 weeks ago" "yesterday" "4 weeks ago" ...
##  $ has_availability                            : chr  "t" "t" "t" "t" ...
##  $ availability_30                             : int  0 8 17 11 0 5 0 0 0 22 ...
##  $ availability_60                             : int  0 38 28 11 0 35 0 0 0 33 ...
##  $ availability_90                             : int  0 68 48 11 12 65 0 0 0 50 ...
##  $ availability_365                            : int  18 158 112 35 73 65 0 48 66 130 ...
##  $ number_of_reviews                           : int  2 5 52 30 14 20 1 28 6 23 ...
##  $ number_of_reviews_ltm                       : int  2 4 45 29 14 20 0 1 0 10 ...
##  $ first_review                                : chr  "2019-07-15" "2017-04-11" "2019-01-21" "2018-05-19" ...
##  $ last_review                                 : chr  "2019-07-20" "2019-12-31" "2020-03-01" "2020-01-01" ...
##  $ review_scores_rating                        : int  100 92 100 94 99 100 100 99 96 93 ...
##  $ review_scores_accuracy                      : int  10 10 10 9 10 10 10 10 10 9 ...
##  $ review_scores_cleanliness                   : int  10 10 10 9 9 10 10 10 10 9 ...
##  $ review_scores_checkin                       : int  10 10 10 9 10 10 10 10 10 9 ...
##  $ review_scores_communication                 : int  10 10 10 10 10 10 10 10 10 9 ...
##  $ review_scores_location                      : int  10 10 10 10 10 10 10 10 10 9 ...
##  $ review_scores_value                         : int  10 8 10 9 10 10 10 9 10 9 ...
##  $ requires_license                            : chr  "f" "f" "f" "f" ...
##  $ license                                     : chr  "" "" "" "" ...
##  $ jurisdiction_names                          : chr  "" "" "" "" ...
##  $ instant_bookable                            : chr  "f" "f" "f" "f" ...
##  $ is_business_travel_ready                    : chr  "f" "f" "f" "f" ...
##  $ cancellation_policy                         : chr  "strict_14_with_grace_period" "strict_14_with_grace_period" "strict_14_with_grace_period" "strict_14_with_grace_period" ...
##  $ require_guest_profile_picture               : chr  "f" "f" "f" "f" ...
##  $ require_guest_phone_verification            : chr  "f" "f" "f" "f" ...
##  $ calculated_host_listings_count              : int  1 1 1 2 3 1 1 12 2 2 ...
##  $ calculated_host_listings_count_entire_homes : int  1 1 1 0 0 0 0 12 1 2 ...
##  $ calculated_host_listings_count_private_rooms: int  0 0 0 2 3 1 1 0 1 0 ...
##  $ calculated_host_listings_count_shared_rooms : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reviews_per_month                           : num  2 0.14 3.72 1.35 1.34 2.33 0.04 1.22 0.07 0.98 ...
dim(data)
## [1] 41330    91

We can see “Data” has 41330 observations and 91 features.

3 Exploratory Data Analysis

3.1 Explore airbnb’s Price

As the plot shows, the airbnb rent price are right skewed. This means only small proportion of houses are expensive.

## Target Variable Exploration
ggplot(data, aes(price)) + 
  geom_histogram(fill = "red",bins = 100) + 
  labs(title ="Price Distribution") + 
  scale_x_continuous(breaks= seq(0, 1050, by=50))

summary(data$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      66     100     137     170     999

3.2 Correlation with Price

I want to see the correlation within all numeric variables

numericVars <- which(sapply(data, is.numeric)) #index vector numeric variables
numericVarNames <- names(numericVars) #saving names vector for use later on
cat('There are', length(numericVars), 'numeric variables')
## There are 41 numeric variables
all_numVar <- data[, numericVars]
cor_numVar <- cor(all_numVar, use="pairwise.complete.obs") #correlations of all numeric variables

#sort on decreasing correlations with price
cor_sorted <- as.matrix(sort(cor_numVar[,'price'], decreasing = TRUE))
 #select corelations > 0.2
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.2)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]

corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")

Accommodates has high correlation with price. It also becomes clear the multicollinearity is an issue. For example: the correlation accommodates,bedrooms, beds is very high, we might consider drop some of these variables.

4 Combine training set with test set.

train <- data
test <- ScoringData

Combine Training set, validation set, and test set.

train_labels <- train$id
test_labels <- test$id
#test$id <- NULL
#train$id <- NULL
#validation$id <- NULL
test$price <- NA
all <- rbind(train,test)
dim(all)
## [1] 51663    91

Create test set with NA price.

4.1 drop some variables

Using function Hmisc::describe(), we can examine our variables.

# Hmisc::describe(all)

I comment the code because the output contains too much information.

We can see from the output, some features contain too many missing values and some features are text. Due to limit time and effort, I will exclude them from our data.

colnames = c("name","summary","space","description","neighborhood_overview",
             "notes","host_location",
             "transit","access","interaction","house_rules","host_name",
             "host_about","street",
             "city","state","market","smart_location",
             "country_code","country","license","jurisdiction_names")
all[,colnames] <-  list(NULL)

5 More EDA, Missing data, label encoding

5.1 Checking missing value

How many numeric variables have missing value? Let’s take a closer look!

Index of variables with missing values.

NAcol <- which(colSums(is.na(all)) > 0)
NAcol 
##       host_listings_count host_total_listings_count                      beds 
##                         8                         9                        23 
##               square_feet                     price              weekly_price 
##                        26                        27                        28 
##             monthly_price          security_deposit              cleaning_fee 
##                        29                        30                        31 
##         reviews_per_month 
##                        69

We can see from below, variables like “monthly_price”,“square_feet”,“weekly_price”,“security_deposit” all have large amount of missing value.

sort(colSums(sapply(all[NAcol], is.na)), decreasing = TRUE)
##             monthly_price               square_feet              weekly_price 
##                     51418                     51235                     47635 
##          security_deposit                     price              cleaning_fee 
##                     16722                     10333                      7474 
##                      beds       host_listings_count host_total_listings_count 
##                       153                        10                        10 
##         reviews_per_month 
##                         2

There are 10 numeric variables have missing values.

5.2 Tackle Missing Values

In this section, I am going to fix the 10 variables that contains missing values.

Besides making sure that the missing values are taken care off.I would convert categorical data into factors, but for variables like zipcode or neighbourhood_cleansed, I would use another approach to encode them.

5.2.1 Impute all TRUE, FALSE (Logical variables)

We select all variables that should be logical from dataset.

cl <- c("host_is_superhost","host_has_profile_pic","host_identity_verified",
        "is_location_exact","has_availability","requires_license","instant_bookable",
        "is_business_travel_ready","require_guest_profile_picture","require_guest_phone_verification")

Inspect variables.

Hmisc::describe(all[,cl])
## all[, cl] 
## 
##  10  Variables      51663  Observations
## --------------------------------------------------------------------------------
## host_is_superhost 
##        n  missing distinct 
##    51653       10        2 
##                       
## Value          f     t
## Frequency  40335 11318
## Proportion 0.781 0.219
## --------------------------------------------------------------------------------
## host_has_profile_pic 
##        n  missing distinct 
##    51653       10        2 
##                       
## Value          f     t
## Frequency    130 51523
## Proportion 0.003 0.997
## --------------------------------------------------------------------------------
## host_identity_verified 
##        n  missing distinct 
##    51653       10        2 
##                       
## Value          f     t
## Frequency  27837 23816
## Proportion 0.539 0.461
## --------------------------------------------------------------------------------
## is_location_exact 
##        n  missing distinct 
##    51663        0        2 
##                       
## Value          f     t
## Frequency   8928 42735
## Proportion 0.173 0.827
## --------------------------------------------------------------------------------
## has_availability 
##        n  missing distinct    value 
##    51663        0        1        t 
##                 
## Value          t
## Frequency  51663
## Proportion     1
## --------------------------------------------------------------------------------
## requires_license 
##        n  missing distinct    value 
##    51663        0        1        f 
##                 
## Value          f
## Frequency  51663
## Proportion     1
## --------------------------------------------------------------------------------
## instant_bookable 
##        n  missing distinct 
##    51663        0        2 
##                       
## Value          f     t
## Frequency  31160 20503
## Proportion 0.603 0.397
## --------------------------------------------------------------------------------
## is_business_travel_ready 
##        n  missing distinct    value 
##    51663        0        1        f 
##                 
## Value          f
## Frequency  51663
## Proportion     1
## --------------------------------------------------------------------------------
## require_guest_profile_picture 
##        n  missing distinct 
##    51663        0        2 
##                       
## Value          f     t
## Frequency  50470  1193
## Proportion 0.977 0.023
## --------------------------------------------------------------------------------
## require_guest_phone_verification 
##        n  missing distinct 
##    51663        0        2 
##                       
## Value          f     t
## Frequency  50358  1305
## Proportion 0.975 0.025
## --------------------------------------------------------------------------------

From the table we can see that some variables(has_availability,requires_license,is_business_travel_ready) have only 1 level. These variables do not help fitting the model, we should drop them.

Variables, host_is_superhost and host_identity_verified have empty value. We can impute them with majority class.

all %>%
  group_by(host_response_rate) %>%
  count()
## # A tibble: 92 x 2
## # Groups:   host_response_rate [92]
##    host_response_rate     n
##    <chr>              <int>
##  1 ""                    10
##  2 "0%"                 569
##  3 "10%"                 17
##  4 "100%"             26023
##  5 "11%"                  5
##  6 "13%"                  5
##  7 "14%"                  4
##  8 "17%"                 10
##  9 "18%"                  5
## 10 "20%"                 45
## # … with 82 more rows
cl <- c("host_is_superhost","host_identity_verified","is_location_exact","instant_bookable")

all <- all %>%
  select(!c("has_availability","requires_license","is_business_travel_ready","host_has_profile_pic")) %>%
  mutate(host_is_superhost = if_else(host_is_superhost=="","f",host_is_superhost)) %>%
  mutate(host_identity_verified = if_else(host_identity_verified=="","f",host_identity_verified)) %>%
  mutate(host_is_superhost = if_else(host_is_superhost=="t","1","0")) %>% # t <- 1, f <- 0
  mutate(host_identity_verified = if_else(host_identity_verified=="t","1","0")) %>%
  mutate(is_location_exact = if_else(is_location_exact=="t","1","0")) %>%
  mutate(instant_bookable= if_else(instant_bookable=="t","1","0")) %>%# t <- 1, f <- 0
  mutate_at(cl,as.numeric) #Convert string to numeric 

5.2.2 host_response_rate

The host_response_rate should be numerical, but we have 14644 NA, We can just drop this column.

Hmisc::describe(all$host_response_rate)
## all$host_response_rate 
##        n  missing distinct 
##    51653       10       91 
## 
## lowest : 0%   10%  100% 11%  13% , highest: 96%  97%  98%  99%  N/A

5.2.3 host_response_time

Hmisc::describe(all$host_response_time)
## all$host_response_time 
##        n  missing distinct 
##    51653       10        5 
## 
## lowest : a few days or more N/A                within a day       within a few hours within an hour    
## highest: a few days or more N/A                within a day       within a few hours within an hour    
##                                                                    
## Value      a few days or more                N/A       within a day
## Frequency                1209              14695               4770
## Proportion              0.023              0.284              0.092
##                                                 
## Value      within a few hours     within an hour
## Frequency                7028              23951
## Proportion              0.136              0.464
ggplot(train,aes(host_response_time,price))+
  geom_boxplot()

all%>%
  dplyr::group_by(host_response_time) %>%
  dplyr::count() %>%
  dplyr::arrange(desc(n))
## # A tibble: 6 x 2
## # Groups:   host_response_time [6]
##   host_response_time       n
##   <chr>                <int>
## 1 "within an hour"     23951
## 2 "N/A"                14695
## 3 "within a few hours"  7028
## 4 "within a day"        4770
## 5 "a few days or more"  1209
## 6 ""                      10
all <- all %>%
  mutate(host_response_time = if_else(host_response_time=="","unknown",host_response_time)) %>%
  mutate(host_response_time = if_else(host_response_time=="N/A","unknown",host_response_time))

We convert all empty strings and NA into one category called “unknown”.

5.2.4 property_type

Hmisc::describe(all$property_type)
## all$property_type 
##        n  missing distinct 
##    51663        0       35 
## 
## lowest : Aparthotel        Apartment         Barn              Bed and breakfast Boat             
## highest: Timeshare         Tiny house        Townhouse         Treehouse         Villa

No missing values. Very good!

unique(train$property_type)
##  [1] "Condominium"            "Apartment"              "Guest suite"           
##  [4] "House"                  "Hotel"                  "Townhouse"             
##  [7] "Guesthouse"             "Other"                  "Serviced apartment"    
## [10] "Boutique hotel"         "Loft"                   "Bed and breakfast"     
## [13] "Villa"                  "Resort"                 "Bungalow"              
## [16] "Hostel"                 "Cave"                   "Aparthotel"            
## [19] "Tiny house"             "Earth house"            "Cabin"                 
## [22] "Dome house"             "Camper/RV"              "Cottage"               
## [25] "Casa particular (Cuba)" "Boat"                   "Treehouse"             
## [28] "Tent"                   "Barn"                   "Houseboat"             
## [31] "Castle"                 "Island"                 "Nature lodge"
unique(test$property_type)
##  [1] "Apartment"          "Condominium"        "Loft"              
##  [4] "Townhouse"          "House"              "Boutique hotel"    
##  [7] "Guest suite"        "Tiny house"         "Bed and breakfast" 
## [10] "Guesthouse"         "Serviced apartment" "Cottage"           
## [13] "Villa"              "Castle"             "Hostel"            
## [16] "Other"              "Houseboat"          "Hotel"             
## [19] "Resort"             "Timeshare"          "Aparthotel"        
## [22] "Boat"               "Tent"               "Cave"              
## [25] "Camper/RV"          "Bungalow"           "Earth house"       
## [28] "Lighthouse"         "Dome house"

We can observe that there is some new levels in test set. Also, our property_type variable is high cardinality, which means it has many categories. Such variable is not suitable for one-hot encoding. One-hot encoding add too many new featuresto our model, causing curse of dimensionality. That is something we should prevent.

My approach is using mean-encoding to encode all levels. We do encoding on the train set, and apply it to the test set. If we encounter a new category. The Vtreat package handles everything for us. Also the Vtreat package use 5 folds cross validation to prevent overfitting.

First, we explore the average price for each property type. The red dot in each box represents its mean.

ggplot(data = train,
       aes(x = property_type, y = price)) +
geom_boxplot() + 
stat_summary(fun = mean, color = "darkred", position = position_dodge(0.75),
             geom = "point", shape = 18, size = 3,
             show.legend = FALSE) +
  coord_flip()

However, when We looking into the distribution of number of unique value, we found it was highly unbalanced.

The distribution is so skewed, and it hits me that we can convert all rare categories into one category called “Other”. This is a very common approach for handling unbalanced categorical feature.

all%>%
  count(property_type, sort=T)%>%
  ggplot(aes(reorder(property_type, n),n))+
  geom_col()+
  coord_flip()+
  theme_minimal()+
  theme(axis.text.y = element_text(size=6))+
  labs(x='Unique categories', y='Times appears in dataset')

all%>%
  count(property_type, sort=T)%>%
  mutate(total = sum(n),
         perc = n/total,
         cum_perc = cumsum(perc),
         nums = 1:n_distinct(property_type))%>%
  ggplot(aes(cum_perc, nums))+
  geom_line()+
  geom_vline(xintercept = .90, linetype='dashed')+
  labs(x='Cumulative Percentage',y='Number of unique factor levels', title='How many unique levels capture 90% of all observed levels?')+
  theme_minimal()

Wow, It seems that only 3 factors contributed to 90% of observed value.

One common approach for this issue is to collapse small categories to one category.

for (i in 1:nrow(all)) {
  if (all$property_type[i] %in% c("Apartment","Townhouse","House","Condominium","Loft","Guest suite","Service apartment","Boutique hotel","Hotel") == FALSE){
    all$property_type[i] = "Other"
                                  }
  
}

Examine the distribution again

all%>%
  count(property_type, sort=T)%>%
  ggplot(aes(reorder(property_type, n),n))+
  geom_col()+
  coord_flip()+
  theme_minimal()+
  theme(axis.text.y = element_text(size=6))+
  labs(x='Unique categories', y='Times appears in dataset')

5.2.5 room_type

ggplot(data = train,
       aes(x = room_type, y = price)) +
geom_boxplot() + 
stat_summary(fun = mean, color = "darkred", position = position_dodge(0.75),
             geom = "point", shape = 18, size = 3,
             show.legend = FALSE) +
  coord_flip()

all%>%
  count(room_type, sort=T)%>%
  ggplot(aes(reorder(room_type, n),n))+
  geom_col()+
  coord_flip()+
  theme_minimal()+
  theme(axis.text.y = element_text(size=10))+
  labs(x='Unique categories', y='Times appears in dataset')

5.2.6 bed_type

ggplot(data = train,
       aes(x = bed_type, y = price)) +
geom_boxplot() + 
stat_summary(fun = mean, color = "darkred", position = position_dodge(0.75),
             geom = "point", shape = 18, size = 3,
             show.legend = FALSE) +
  coord_flip()

all%>%
  count(bed_type, sort=T)%>%
  ggplot(aes(reorder(bed_type, n),n))+
  geom_col()+
  coord_flip()+
  theme_minimal()+
  theme(axis.text.y = element_text(size=10))+
  labs(x='Unique categories', y='Times appears in dataset')

Convert property/room/bed to factor, I would use one-hot encoding on these features later.

fc <- c("property_type","bed_type","room_type")
all <- all %>%
  mutate_at(fc,as.factor) 

5.2.7 monthly_price,square_feet,weekly_price,security_deposit

These four variables contains 1/3 missing values. We will exclude them from our dataset.

all <- all %>%
  select(!c("monthly_price","square_feet","weekly_price","security_deposit"))
which(colSums(is.na(all)) > 0)
##       host_listings_count host_total_listings_count                      beds 
##                         8                         9                        22 
##                     price              cleaning_fee         reviews_per_month 
##                        25                        26                        61

5.2.8 impute with 0

all <- all %>% 
  mutate_at(c("cleaning_fee","beds","host_listings_count","host_total_listings_count",
              "reviews_per_month"), 
            ~replace(., is.na(.),0)) 

Now we can see in numeric variables there is only price has missing value.

which(colSums(is.na(all)) > 0)
## price 
##    25

5.2.9 review_scores_rating

Review scores rating is on 1-100 scale, we can convert it to 1-10 scale.

all <- all %>%
  mutate(review_scores_rating = review_scores_rating/10) 

5.2.10 cancellation_policy

all <- all %>%
  mutate_at("cancellation_policy",as.factor) 

5.2.11 zipcode

We have ZIP code of houses, and the overall quality of Zip code data looks good (some zip codes need to be cleaned).

I am thinking to use Zip code and its longitude and latitude, we could have geographic variables. To my knowledge, this two variables can be very helpful when we use tree-based model to train our data. Therefore, we need another dataset which contains Zip code info, longitude and latitude info. We can join data_drop table with ZIP code Info table to get the result.

Let’s first clean our Zip code variable.

Notice that our Zip codes should be 5 digits (houses are located in NY). Check how many digits our ZIP codes have.

unique(nchar(all$zipcode))
## [1]  5  0  8  6 10  1  4 11

“6” means there are some ZIP codes contain spaces, let’s remove them.

library(stringr)
all$zipcode <- str_remove_all(all$zipcode," ")

We have 4 digits Zip code because some houses has the wrong Zip code, we should remove them from our data. We search the zip code of East Village, the corret zipcode should be 10009.

all[nchar(all$zipcode)==4,c("zipcode","neighbourhood")]
##       zipcode neighbourhood
## 38058    1009  East Village
all[nchar(all$zipcode)==4,c("zipcode")] <- "10009"

Some Zip codes contain character “NY”! That’s why we have 8 digits Zip code.

We just want the numbers, we don’t want “NY” to show up in our Zip code!

library(stringr)
all$zipcode = str_replace_all(all$zipcode,"NY","")
unique(nchar(all$zipcode))
## [1]  5  0 10 11

It Looks like we successfully remove “NY”.

We have 10 digits Zip code because Some ZIP codes includes the street number, we should exclude street character. For example:

all$zipcode<- sapply(all$zipcode, substring, 1, 5)
unique(nchar(all$zipcode))
## [1] 5 0

There are some empty strings in Zip code variables,How to impute them? I am going to use neighbourhood to impute zipcode

all[all$zipcode =="",c("zipcode","neighbourhood")] %>%
  arrange(neighbourhood)
##     zipcode       neighbourhood
## 1                 Alphabet City
## 2                 Alphabet City
## 3                 Alphabet City
## 4                 Alphabet City
## 5                 Alphabet City
## 6                      Arrochar
## 7                       Astoria
## 8                       Astoria
## 9                       Astoria
## 10                      Astoria
## 11                      Astoria
## 12                      Astoria
## 13            Battery Park City
## 14           Bedford-Stuyvesant
## 15           Bedford-Stuyvesant
## 16           Bedford-Stuyvesant
## 17           Bedford-Stuyvesant
## 18           Bedford-Stuyvesant
## 19           Bedford-Stuyvesant
## 20           Bedford-Stuyvesant
## 21           Bedford-Stuyvesant
## 22           Bedford-Stuyvesant
## 23           Bedford-Stuyvesant
## 24           Bedford-Stuyvesant
## 25           Bedford-Stuyvesant
## 26           Bedford-Stuyvesant
## 27           Bedford-Stuyvesant
## 28           Bedford-Stuyvesant
## 29           Bedford-Stuyvesant
## 30           Bedford-Stuyvesant
## 31           Bedford-Stuyvesant
## 32           Bedford-Stuyvesant
## 33           Bedford-Stuyvesant
## 34           Bedford-Stuyvesant
## 35                  Boerum Hill
## 36                  Boerum Hill
## 37                     Brooklyn
## 38                     Brooklyn
## 39                     Brooklyn
## 40                     Brooklyn
## 41                     Brooklyn
## 42                     Brooklyn
## 43                     Brooklyn
## 44                     Brooklyn
## 45                     Brooklyn
## 46                     Brooklyn
## 47                     Brooklyn
## 48                     Brooklyn
## 49                     Brooklyn
## 50                     Brooklyn
## 51                     Brooklyn
## 52                     Brooklyn
## 53                     Brooklyn
## 54                     Brooklyn
## 55                     Brooklyn
## 56                     Brooklyn
## 57                     Brooklyn
## 58                     Brooklyn
## 59                     Brooklyn
## 60                     Brooklyn
## 61                     Brooklyn
## 62                     Brooklyn
## 63                     Brooklyn
## 64                     Brooklyn
## 65                     Brooklyn
## 66                     Brooklyn
## 67                     Brooklyn
## 68                     Brooklyn
## 69                     Brooklyn
## 70                     Brooklyn
## 71                     Brooklyn
## 72                     Brooklyn
## 73                     Brooklyn
## 74                     Brooklyn
## 75                     Brooklyn
## 76                     Brooklyn
## 77                     Brooklyn
## 78                     Brooklyn
## 79                     Brooklyn
## 80                     Brooklyn
## 81                     Brooklyn
## 82                     Brooklyn
## 83                     Brooklyn
## 84                     Brooklyn
## 85                     Brooklyn
## 86                     Brooklyn
## 87                     Brooklyn
## 88                     Brooklyn
## 89                     Brooklyn
## 90                     Brooklyn
## 91                     Brooklyn
## 92                     Brooklyn
## 93                     Brooklyn
## 94                     Brooklyn
## 95                     Brooklyn
## 96                     Brooklyn
## 97                     Brooklyn
## 98                     Brooklyn
## 99                     Brooklyn
## 100                    Brooklyn
## 101                    Brooklyn
## 102                    Brooklyn
## 103                    Brooklyn
## 104                    Brooklyn
## 105                    Brooklyn
## 106                    Brooklyn
## 107                    Brooklyn
## 108                    Brooklyn
## 109                    Brooklyn
## 110                    Brooklyn
## 111                    Brooklyn
## 112                    Brooklyn
## 113                    Brooklyn
## 114                    Brooklyn
## 115                    Brooklyn
## 116                    Brooklyn
## 117                    Brooklyn
## 118                    Brooklyn
## 119                    Brooklyn
## 120                    Brooklyn
## 121                    Brooklyn
## 122                    Brooklyn
## 123                    Brooklyn
## 124                    Brooklyn
## 125                    Brooklyn
## 126                    Brooklyn
## 127                    Brooklyn
## 128                    Brooklyn
## 129                    Brooklyn
## 130                    Bushwick
## 131                    Bushwick
## 132                    Bushwick
## 133                    Bushwick
## 134                    Bushwick
## 135                    Bushwick
## 136                    Bushwick
## 137                    Bushwick
## 138                    Bushwick
## 139                    Bushwick
## 140                     Chelsea
## 141                     Chelsea
## 142                     Chelsea
## 143                     Chelsea
## 144                     Chelsea
## 145                     Chelsea
## 146                     Chelsea
## 147                     Chelsea
## 148                     Chelsea
## 149                     Chelsea
## 150                   Chinatown
## 151                   Chinatown
## 152                   Chinatown
## 153                Clinton Hill
## 154                Clinton Hill
## 155                Clinton Hill
## 156                Clinton Hill
## 157                   Concourse
## 158                     Crotona
## 159               Crown Heights
## 160               Crown Heights
## 161               Crown Heights
## 162               Crown Heights
## 163           Downtown Brooklyn
## 164           Downtown Brooklyn
## 165               East Elmhurst
## 166               East Flatbush
## 167                 East Harlem
## 168                 East Harlem
## 169                 East Harlem
## 170                 East Harlem
## 171               East New York
## 172               East New York
## 173               East New York
## 174               East New York
## 175                East Village
## 176                East Village
## 177                East Village
## 178                East Village
## 179                East Village
## 180                East Village
## 181                East Village
## 182                East Village
## 183                East Village
## 184                East Village
## 185                    Elmhurst
## 186          Financial District
## 187          Financial District
## 188          Financial District
## 189          Financial District
## 190          Financial District
## 191                    Flatbush
## 192                    Flatbush
## 193                    Flatbush
## 194           Flatiron District
## 195           Flatiron District
## 196                    Flushing
## 197                    Flushing
## 198                 Fort Greene
## 199               Gramercy Park
## 200                   Gravesend
## 201                  Greenpoint
## 202                  Greenpoint
## 203                  Greenpoint
## 204                 Grymes Hill
## 205            Hamilton Heights
## 206            Hamilton Heights
## 207            Hamilton Heights
## 208            Hamilton Heights
## 209            Hamilton Heights
## 210                      Harlem
## 211                      Harlem
## 212                      Harlem
## 213                      Harlem
## 214                      Harlem
## 215                      Harlem
## 216                      Harlem
## 217                      Harlem
## 218                      Harlem
## 219                      Harlem
## 220                      Harlem
## 221                      Harlem
## 222                      Harlem
## 223                      Harlem
## 224                      Harlem
## 225              Hell's Kitchen
## 226              Hell's Kitchen
## 227              Hell's Kitchen
## 228              Hell's Kitchen
## 229              Hell's Kitchen
## 230                Howard Beach
## 231                      Inwood
## 232            Kew Garden Hills
## 233                    Kips Bay
## 234                    Kips Bay
## 235             Lefferts Garden
## 236             Lefferts Garden
## 237             Lefferts Garden
## 238            Long Island City
## 239             Lower East Side
## 240             Lower East Side
## 241             Lower East Side
## 242             Lower East Side
## 243             Lower East Side
## 244             Lower East Side
## 245             Lower East Side
## 246             Lower East Side
## 247                   Manhattan
## 248                   Manhattan
## 249                   Manhattan
## 250                   Manhattan
## 251                   Manhattan
## 252                   Manhattan
## 253                   Manhattan
## 254                   Manhattan
## 255                   Manhattan
## 256                   Manhattan
## 257                   Manhattan
## 258                   Manhattan
## 259                   Manhattan
## 260                   Manhattan
## 261                   Manhattan
## 262                   Manhattan
## 263                   Manhattan
## 264                   Manhattan
## 265                   Manhattan
## 266                   Manhattan
## 267                   Manhattan
## 268                   Manhattan
## 269                   Manhattan
## 270                   Manhattan
## 271                   Manhattan
## 272                   Manhattan
## 273                   Manhattan
## 274                   Manhattan
## 275                   Manhattan
## 276                   Manhattan
## 277                   Manhattan
## 278                   Manhattan
## 279                   Manhattan
## 280                   Manhattan
## 281                   Manhattan
## 282                   Manhattan
## 283                   Manhattan
## 284                   Manhattan
## 285                   Manhattan
## 286                   Manhattan
## 287                   Manhattan
## 288                   Manhattan
## 289                   Manhattan
## 290                   Manhattan
## 291                   Manhattan
## 292                   Manhattan
## 293                   Manhattan
## 294                   Manhattan
## 295                   Manhattan
## 296                   Manhattan
## 297                   Manhattan
## 298                   Manhattan
## 299                   Manhattan
## 300                   Manhattan
## 301                   Manhattan
## 302                   Manhattan
## 303                   Manhattan
## 304                   Manhattan
## 305                   Manhattan
## 306                   Manhattan
## 307                   Manhattan
## 308                   Manhattan
## 309                   Manhattan
## 310                   Manhattan
## 311                   Manhattan
## 312                   Manhattan
## 313                   Manhattan
## 314                   Manhattan
## 315                   Manhattan
## 316                   Manhattan
## 317                   Manhattan
## 318                   Manhattan
## 319                   Manhattan
## 320                   Manhattan
## 321                   Manhattan
## 322                   Manhattan
## 323                   Manhattan
## 324                   Manhattan
## 325                   Manhattan
## 326                   Manhattan
## 327                   Manhattan
## 328                   Manhattan
## 329                   Manhattan
## 330                   Manhattan
## 331                   Manhattan
## 332                   Manhattan
## 333                   Manhattan
## 334                   Manhattan
## 335                   Manhattan
## 336                   Manhattan
## 337                   Manhattan
## 338                   Manhattan
## 339                   Manhattan
## 340                   Manhattan
## 341                   Manhattan
## 342                   Manhattan
## 343                   Manhattan
## 344                   Manhattan
## 345                   Manhattan
## 346                   Manhattan
## 347                     Maspeth
## 348                     Midtown
## 349                     Midtown
## 350                     Midtown
## 351                     Midtown
## 352                     Midtown
## 353                     Midtown
## 354                     Midtown
## 355                     Midtown
## 356                     Midtown
## 357                     Midtown
## 358                     Midtown
## 359                     Midtown
## 360                     Midtown
## 361                Midtown East
## 362                Midtown East
## 363                Midtown East
## 364                     Midwood
## 365         Morningside Heights
## 366                  Mott Haven
## 367                  Park Slope
## 368            Prospect Heights
## 369                      Queens
## 370                      Queens
## 371                      Queens
## 372                      Queens
## 373                      Queens
## 374                      Queens
## 375                      Queens
## 376                      Queens
## 377                      Queens
## 378                      Queens
## 379                      Queens
## 380                      Queens
## 381                      Queens
## 382                      Queens
## 383                      Queens
## 384                      Queens
## 385                      Queens
## 386                      Queens
## 387                      Queens
## 388                      Queens
## 389                      Queens
## 390                      Queens
## 391                   Rego Park
## 392               Richmond Hill
## 393                   Ridgewood
## 394            South Ozone Park
## 395                   Sunnyside
## 396                   Sunnyside
## 397                   Sunnyside
## 398                   Sunnyside
## 399                 Sunset Park
## 400                 Sunset Park
## 401                   The Bronx
## 402                   The Bronx
## 403                   The Bronx
## 404                   The Bronx
## 405                   The Bronx
## 406             Upper East Side
## 407             Upper East Side
## 408             Upper East Side
## 409             Upper East Side
## 410             Upper East Side
## 411             Upper East Side
## 412             Upper East Side
## 413             Upper East Side
## 414             Upper East Side
## 415             Upper East Side
## 416             Upper East Side
## 417             Upper West Side
## 418             Upper West Side
## 419             Upper West Side
## 420          Washington Heights
## 421          Washington Heights
## 422          Washington Heights
## 423                West Village
## 424                West Village
## 425                West Village
## 426                 Westerleigh
## 427                  Whitestone
## 428                Williamsburg
## 429                Williamsburg
## 430                Williamsburg
## 431                Williamsburg
## 432                Williamsburg
## 433                Williamsburg
## 434                Williamsburg
## 435                Williamsburg
## 436                Williamsburg
## 437                Williamsburg
## 438                Williamsburg
## 439                Williamsburg
## 440                Williamsburg
## 441                Williamsburg
## 442                Williamsburg
## 443                Williamsburg
all[all$neighbourhood =="",c("zipcode","neighbourhood")]
##       zipcode neighbourhood
## 1956    11001              
## 2537    11363              
## 2823    10031              
## 3821    11363              
## 3826    10002              
## 4406    11237              
## 4441    11001              
## 4711    11211              
## 4862    10453              
## 5376    11222              
## 7087    11213              
## 7273    11362              
## 8182    11219              
## 8831    11211              
## 9001    11231              
## 9195    10705              
## 10107   11211              
## 10268   10002              
## 11362   10016              
## 11639   10019              
## 11648   11355              
## 12353   11385              
## 12357   11385              
## 12637   10469              
## 12744   11362              
## 13354   11211              
## 13748   11105              
## 14334   10455              
## 14662   10005              
## 16204   11004              
## 16790   10467              
## 20036   11369              
## 21393   11362              
## 21593   11362              
## 22141   10031              
## 22930   11363              
## 22959   10016              
## 23438   11369              
## 23897   11362              
## 24010   10010              
## 24154   11211              
## 26113   10028              
## 26291   11221              
## 26926   10030              
## 27259   11234              
## 28633   11362              
## 30910   11211              
## 32446   11221              
## 32849   10019              
## 33061   10033              
## 33081   11237              
## 33576   10036              
## 33974   10032              
## 36113   11211              
## 36520   11372              
## 36556   10035              
## 37112   10038              
## 37128   11370              
## 37141   11237              
## 37584   10456              
## 39027   10009              
## 51626   11102              
## 51627   11206              
## 51628   10030              
## 51629   10033              
## 51630   11211              
## 51631   11237              
## 51632   11355              
## 51633   11211              
## 51634   10309              
## 51635   11225              
## 51636   10002              
## 51637   11101              
## 51638   10038              
## 51639   11216              
## 51640   11234              
## 51641   11692              
## 51642   10013
all[all$zipcode ==""& all$neighbourhood=="",]
##  [1] id                                          
##  [2] host_since                                  
##  [3] host_response_time                          
##  [4] host_response_rate                          
##  [5] host_acceptance_rate                        
##  [6] host_is_superhost                           
##  [7] host_neighbourhood                          
##  [8] host_listings_count                         
##  [9] host_total_listings_count                   
## [10] host_verifications                          
## [11] host_identity_verified                      
## [12] neighbourhood                               
## [13] neighbourhood_cleansed                      
## [14] neighbourhood_group_cleansed                
## [15] zipcode                                     
## [16] is_location_exact                           
## [17] property_type                               
## [18] room_type                                   
## [19] accommodates                                
## [20] bathrooms                                   
## [21] bedrooms                                    
## [22] beds                                        
## [23] bed_type                                    
## [24] amenities                                   
## [25] price                                       
## [26] cleaning_fee                                
## [27] guests_included                             
## [28] extra_people                                
## [29] minimum_nights                              
## [30] maximum_nights                              
## [31] minimum_minimum_nights                      
## [32] maximum_minimum_nights                      
## [33] minimum_maximum_nights                      
## [34] maximum_maximum_nights                      
## [35] minimum_nights_avg_ntm                      
## [36] maximum_nights_avg_ntm                      
## [37] calendar_updated                            
## [38] availability_30                             
## [39] availability_60                             
## [40] availability_90                             
## [41] availability_365                            
## [42] number_of_reviews                           
## [43] number_of_reviews_ltm                       
## [44] first_review                                
## [45] last_review                                 
## [46] review_scores_rating                        
## [47] review_scores_accuracy                      
## [48] review_scores_cleanliness                   
## [49] review_scores_checkin                       
## [50] review_scores_communication                 
## [51] review_scores_location                      
## [52] review_scores_value                         
## [53] instant_bookable                            
## [54] cancellation_policy                         
## [55] require_guest_profile_picture               
## [56] require_guest_phone_verification            
## [57] calculated_host_listings_count              
## [58] calculated_host_listings_count_entire_homes 
## [59] calculated_host_listings_count_private_rooms
## [60] calculated_host_listings_count_shared_rooms 
## [61] reviews_per_month                           
## <0 rows> (or 0-length row.names)
dim(all)
## [1] 51663    61

So we can see that some missing zip codes has neighbourhood information, so I will try to impute missing zipcode with the highest frequency zipcode in its neighbourhood. (I have no idea if it is the correct way to do it)

zipcode_max <- all %>%
  count(neighbourhood_cleansed,zipcode) %>%
  group_by(neighbourhood_cleansed) %>%
  filter(n == max(n))
as.data.frame(zipcode_max)
##         neighbourhood_cleansed zipcode    n
## 1                     Allerton   10469   28
## 2                Arden Heights   10312    4
## 3                     Arrochar   10305   24
## 4                      Arverne   11692   94
## 5                      Astoria   11103  398
## 6                   Bath Beach   11214   23
## 7            Battery Park City   10280   29
## 8                    Bay Ridge   11209  136
## 9                  Bay Terrace   11360    5
## 10  Bay Terrace, Staten Island   10308    2
## 11                  Baychester   10469    7
## 12                     Bayside   11361   29
## 13                   Bayswater   11691   19
## 14          Bedford-Stuyvesant   11221 1164
## 15                Belle Harbor   11694    6
## 16                   Bellerose   11426    6
## 17                     Belmont   10458   36
## 18                 Bensonhurst   11214   32
## 19                Bergen Beach   11234   12
## 20                 Boerum Hill   11217  115
## 21                Borough Park   11219   95
## 22                Breezy Point   11694    2
## 23                   Briarwood   11435   43
## 24              Brighton Beach   11235   81
## 25                   Bronxdale   10462   21
## 26            Brooklyn Heights   11201  149
## 27                 Brownsville   11212   54
## 28                 Bull's Head   10314    4
## 29                    Bushwick   11237  957
## 30             Cambria Heights   11411   23
## 31                    Canarsie   11236  208
## 32             Carroll Gardens   11231  214
## 33                 Castle Hill   10473    4
## 34           Castleton Corners   10314    2
## 35                     Chelsea   10011  683
## 36                   Chinatown   10002  333
## 37                 City Island   10464   18
## 38                Civic Center   10038   24
## 39           Claremont Village   10456   15
## 40                Clason Point   10473   23
## 41                     Clifton   10304   21
## 42                Clinton Hill   11238  355
## 43                  Co-op City   10475    4
## 44                 Cobble Hill   11201   63
## 45               College Point   11356   25
## 46                 Columbia St   11231   37
## 47                     Concord   10305   14
## 48                   Concourse   10451   28
## 49           Concourse Village   10456   18
## 50                Coney Island   11224   20
## 51                      Corona   11368   89
## 52                Country Club   10465    2
## 53               Crown Heights   11213  548
## 54               Cypress Hills   11208  101
## 55            Ditmars Steinway   11105  263
## 56                Dongan Hills   10306    4
## 57                  Douglaston   11362    7
## 58           Downtown Brooklyn   11201   70
## 59                       DUMBO   11201   30
## 60               Dyker Heights   11228   13
## 61               East Elmhurst   11369  177
## 62               East Flatbush   11203  277
## 63                 East Harlem   10029  796
## 64             East Morrisania   10459    9
## 65               East New York   11207  181
## 66                East Village   10009 1137
## 67                 Eastchester   10475   12
## 68                    Edenwald   10466   16
## 69                    Edgemere   11691    9
## 70                    Elmhurst   11373  288
## 71                 Eltingville   10308    1
## 72                 Eltingville   10312    1
## 73                Emerson Hill   10314    4
## 74                Far Rockaway   11691   31
## 75                   Fieldston   10471    8
## 76          Financial District   10005  359
## 77                    Flatbush   11226  485
## 78           Flatiron District   10010   65
## 79                   Flatlands   11234   67
## 80                    Flushing   11355  304
## 81                     Fordham   10458   25
## 82                     Fordham   10468   25
## 83                Forest Hills   11375  140
## 84                 Fort Greene   11205  207
## 85               Fort Hamilton   11209   57
## 86               Fresh Meadows   11365   29
## 87                   Glen Oaks   11004    1
## 88                    Glendale   11385   73
## 89                     Gowanus   11215  125
## 90                    Gramercy   10003  232
## 91                Graniteville   10303    4
## 92                  Grant City   10306    8
## 93                   Gravesend   11223   71
## 94                 Great Kills   10308   12
## 95                  Greenpoint   11222 1023
## 96           Greenwich Village   10012  226
## 97                 Grymes Hill   10304    6
## 98                      Harlem   10031  844
## 99              Hell's Kitchen   10019 1015
## 100                 Highbridge   10452   32
## 101                     Hollis   11423   15
## 102                 Holliswood   11423    2
## 103               Howard Beach   11414   24
## 104               Howland Hook   10303    2
## 105                   Huguenot   10312    3
## 106                Hunts Point   10474   24
## 107                     Inwood   10034  168
## 108            Jackson Heights   11372  168
## 109                    Jamaica   11434  170
## 110            Jamaica Estates   11432   17
## 111              Jamaica Hills   11432   14
## 112                 Kensington   11218  166
## 113                Kew Gardens   11415   31
## 114          Kew Gardens Hills   11367   23
## 115                Kingsbridge   10463   43
## 116                   Kips Bay   10016  299
## 117                  Laurelton   11413   35
## 118            Lighthouse Hill   10306    2
## 119               Little Italy   10013  152
## 120                Little Neck   11363    2
## 121           Long Island City   11101  396
## 122                   Longwood   10456   22
## 123                   Longwood   10459   22
## 124            Lower East Side   10002  944
## 125            Manhattan Beach   11235   10
## 126                Marble Hill   10463   11
## 127            Mariners Harbor   10303   11
## 128                    Maspeth   11378   77
## 129                    Melrose   10451   10
## 130             Middle Village   11379   39
## 131              Midland Beach   10306    8
## 132                    Midtown   10022  411
## 133                    Midwood   11230   78
## 134                 Mill Basin   11234   11
## 135        Morningside Heights   10025  137
## 136             Morris Heights   10453   19
## 137                Morris Park   10461   12
## 138                 Morrisania   10456    7
## 139                 Mott Haven   10454   43
## 140                 Mount Eden   10452    6
## 141                 Mount Hope   10457   20
## 142                Murray Hill   10016  400
## 143                  Navy Yard   11205    9
## 144                   Neponsit   11694    3
## 145               New Brighton   10301    5
## 146             New Dorp Beach   10306    4
## 147            New Springville   10314    7
## 148                       NoHo   10012   42
## 149                     Nolita   10012  240
## 150            North Riverdale   10705    9
## 151                    Norwood   10467   26
## 152                    Oakwood   10306    7
## 153                  Olinville   10467    5
## 154                 Ozone Park   11417   44
## 155                 Park Slope   11215  296
## 156                Parkchester   10462   24
## 157                 Pelham Bay   10461   15
## 158             Pelham Gardens   10469   31
## 159                Port Morris   10454   26
## 160              Port Richmond   10302   10
## 161               Prince's Bay   10309    1
## 162               Prince's Bay   10312    1
## 163           Prospect Heights   11238  302
## 164  Prospect-Lefferts Gardens   11225  396
## 165             Queens Village   11429   44
## 166              Randall Manor   10310   18
## 167                   Red Hook   11231   90
## 168                  Rego Park   11374  108
## 169              Richmond Hill   11418   53
## 170               Richmondtown   10306    2
## 171                  Ridgewood   11385  413
## 172                  Riverdale   10463    7
## 173             Rockaway Beach   11693   71
## 174           Roosevelt Island   10044   82
## 175                   Rosebank   10305   10
## 176                   Rosedale   11422   83
## 177                  Rossville   10309    3
## 178              Schuylerville   10461    8
## 179                   Sea Gate   11224    7
## 180             Sheepshead Bay   11229   91
## 181                Shore Acres   10305    8
## 182                Silver Lake   10301    2
## 183                       SoHo   10012  186
## 184                  Soundview   10472   11
## 185                South Beach   10305    8
## 186           South Ozone Park   11420   26
## 187                South Slope   11215  287
## 188        Springfield Gardens   11413   59
## 189             Spuyten Duyvil   10463    5
## 190                 St. Albans   11412   53
## 191                 St. George   10301   53
## 192                  Stapleton   10301   21
## 193            Stuyvesant Town   10009   34
## 194                  Sunnyside   11104  242
## 195                Sunset Park   11220  188
## 196           Theater District   10036  125
## 197                Throgs Neck   10465   36
## 198                  Todt Hill   10314    2
## 199              Tompkinsville   10301   37
## 200                Tottenville   10307    4
## 201                    Tremont   10457   13
## 202                    Tribeca   10013   97
## 203                Two Bridges   10002   68
## 204                  Unionport   10472    6
## 205         University Heights   10453   17
## 206            Upper East Side   10128  416
## 207            Upper West Side   10025  845
## 208                   Van Nest   10462   10
## 209               Vinegar Hill   11201   31
## 210                  Wakefield   10466   60
## 211         Washington Heights   10032  520
## 212              West Brighton   10310   23
## 213                 West Farms   10460    3
## 214               West Village   10014  688
## 215         Westchester Square   10461    9
## 216                Westerleigh   10314    7
## 217                 Whitestone   11357   17
## 218             Williamsbridge   10469   26
## 219               Williamsburg   11211 2223
## 220                Willowbrook   10314    1
## 221            Windsor Terrace   11218   91
## 222                  Woodhaven   11421  105
## 223                   Woodlawn   10470    8
## 224                   Woodside   11377  259

impute all the missing zipcode

for (i in 1:nrow(all)){
        if(all$zipcode[i]==""){
              all$zipcode[i]= pull(zipcode_max[zipcode_max$neighbourhood_cleansed == all$neighbourhood_cleansed[i],"zipcode"])
        }
}
dim(all)
## [1] 51663    61
sum(all$zipcode == "")
## [1] 0
Hmisc::describe(all$zipcode)
## all$zipcode 
##        n  missing distinct 
##    51663        0      191 
## 
## lowest : 10001 10002 10003 10004 10005, highest: 11691 11692 11693 11694 13355
dim(all)
## [1] 51663    61
all%>%
  count(zipcode, sort=T)%>%
  ggplot(aes(reorder(zipcode, n),n))+
  geom_col()+
  coord_flip()+
  theme_minimal()+
  theme(axis.text.y = element_text(size=2))+
  labs(x='Unique categories', y='Times appears in dataset')

all %>%
  group_by(zipcode) %>%
  count() %>%
  arrange(desc(n))
## # A tibble: 191 x 2
## # Groups:   zipcode [191]
##    zipcode     n
##    <chr>   <int>
##  1 11211    2302
##  2 11221    2104
##  3 11206    1529
##  4 11216    1439
##  5 10002    1411
##  6 10019    1350
##  7 10009    1204
##  8 11238    1150
##  9 11233    1126
## 10 11222    1106
## # … with 181 more rows

Mean-encode our zipcode

dTrainN <- train
# NOT RUN {
# numeric example
set.seed(23525)
# we perform a vtreat cross frame experiment
# and unpack the results into treatmentsN
# and dTrainNTreated
unpack[
  treatmentsN = treatments,
  dTrainNTreated = crossFrame
  ] <- mkCrossFrameNExperiment(
    dframe = dTrainN,
    varlist = setdiff("zipcode", 'price'),
    outcomename = 'price',
    verbose = FALSE)

# the treatments include a score frame relating new
# derived variables to original columns
treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>%
  print(.)
##    origName             varName code          rsq          sig
## 1   zipcode        zipcode_catP catP 1.855895e-03 1.919999e-18
## 2   zipcode        zipcode_catN catN 1.902765e-01 0.000000e+00
## 3   zipcode        zipcode_catD catD 1.514050e-01 0.000000e+00
## 4   zipcode zipcode_lev_x_10002  lev 1.463277e-03 7.291330e-15
## 5   zipcode zipcode_lev_x_10009  lev 1.959967e-03 2.167616e-19
## 6   zipcode zipcode_lev_x_10019  lev 7.593893e-03 1.748099e-70
## 7   zipcode zipcode_lev_x_11206  lev 3.984955e-03 9.068693e-38
## 8   zipcode zipcode_lev_x_11211  lev 8.112251e-04 6.983097e-09
## 9   zipcode zipcode_lev_x_11216  lev 1.552024e-03 1.129175e-15
## 10  zipcode zipcode_lev_x_11221  lev 5.878451e-03 6.266125e-55
## 11  zipcode zipcode_lev_x_11222  lev 5.018784e-07 8.854857e-01
## 12  zipcode zipcode_lev_x_11233  lev 2.541367e-03 1.125461e-24
## 13  zipcode zipcode_lev_x_11237  lev 4.820195e-03 2.442631e-45
## 14  zipcode zipcode_lev_x_11238  lev 1.391928e-05 4.481790e-01
##    extraModelDegrees
## 1                208
## 2                208
## 3                208
## 4                  0
## 5                  0
## 6                  0
## 7                  0
## 8                  0
## 9                  0
## 10                 0
## 11                 0
## 12                 0
## 13                 0
## 14                 0
# the treated frame is a "cross frame" which
# is a transform of the training data built 
# as if the treatment were learned on a different
# disjoint training set to avoid nested model
# bias and over-fit.
dTrainNTreated %.>%
  head(.) %.>%
  print(.)
##   zipcode_catP zipcode_catN zipcode_catD zipcode_lev_x_10002
## 1  0.012811207   -59.044505     47.52400                   0
## 2  0.021957682    -2.906385    104.05159                   0
## 3  0.021013283    -1.465574    103.63760                   0
## 4  0.007367619    91.877609    189.38397                   0
## 5  0.004573005   -54.118285     42.04390                   0
## 6  0.007295031   -40.282820     57.64091                   0
##   zipcode_lev_x_10009 zipcode_lev_x_10019 zipcode_lev_x_11206
## 1                   0                   0                   0
## 2                   0                   0                   0
## 3                   0                   0                   0
## 4                   0                   0                   0
## 5                   0                   0                   0
## 6                   0                   0                   0
##   zipcode_lev_x_11211 zipcode_lev_x_11216 zipcode_lev_x_11221
## 1                   0                   0                   0
## 2                   0                   0                   0
## 3                   0                   0                   0
## 4                   0                   0                   0
## 5                   0                   0                   0
## 6                   0                   0                   0
##   zipcode_lev_x_11222 zipcode_lev_x_11233 zipcode_lev_x_11237
## 1                   0                   0                   0
## 2                   0                   0                   0
## 3                   0                   0                   0
## 4                   0                   0                   0
## 5                   0                   0                   0
## 6                   0                   0                   0
##   zipcode_lev_x_11238 price
## 1                   0   145
## 2                   1   199
## 3                   1   165
## 4                   0    59
## 5                   0    48
## 6                   0    80
# Any future application data is prepared with
# the prepare method.
dTestNTreated <- prepare(treatmentsN, test, pruneSig=NULL)

dTestNTreated %.>%
  head(.) %.>%
  print(.)
##   zipcode_catP zipcode_catN zipcode_catD zipcode_lev_x_10002
## 1   0.01541253    -13.76147     89.82499                   0
## 2   0.01613840    -39.19542     76.89985                   0
## 3   0.02298572     32.05681    109.19339                   0
## 4   0.01696105     13.44983     94.44524                   0
## 5   0.01592064     35.58320    114.96631                   0
## 6   0.01613840    -39.19542     76.89985                   0
##   zipcode_lev_x_10009 zipcode_lev_x_10019 zipcode_lev_x_11206
## 1                   0                   0                   0
## 2                   0                   0                   0
## 3                   1                   0                   0
## 4                   0                   0                   0
## 5                   0                   0                   0
## 6                   0                   0                   0
##   zipcode_lev_x_11211 zipcode_lev_x_11216 zipcode_lev_x_11221
## 1                   0                   0                   0
## 2                   0                   0                   0
## 3                   0                   0                   0
## 4                   0                   0                   0
## 5                   0                   0                   0
## 6                   0                   0                   0
##   zipcode_lev_x_11222 zipcode_lev_x_11233 zipcode_lev_x_11237
## 1                   0                   0                   0
## 2                   0                   0                   0
## 3                   0                   0                   0
## 4                   0                   0                   0
## 5                   0                   0                   0
## 6                   0                   0                   0
##   zipcode_lev_x_11238 price
## 1                   0    NA
## 2                   0    NA
## 3                   0    NA
## 4                   0    NA
## 5                   0    NA
## 6                   0    NA
# }
all[1:41330,"zipcode_m"] <- dTrainNTreated$zipcode_catN
all[41331:51663,"zipcode_m"]<- dTestNTreated$zipcode_catN
all[,c("zipcode_m")] <- sapply(all[,c("zipcode_m")],as.numeric)
head(all$zipcode_m)
## [1] -59.044505  -2.906385  -1.465574  91.877609 -54.118285 -40.282820
ggplot(all,aes(zipcode_m))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(all,aes(zipcode_m,price))+
  geom_point()
## Warning: Removed 10333 rows containing missing values (geom_point).

5.2.12 host_neighbourhood, neighbourhood_cleansed& group_cleansed

We can see that our neighbourhood variable has so many levels. First thought is using one-hot encoding, but this measure could lead to very large dimension in our dataset, causing multicollinearity and curse of dimension. Also, I would use impact encoding of Vtreat package instead of collapsing rare categories. I experiment two approaches and the impact encoding gave me a lower RMSE on local test cv score.

Hmisc::describe(all$host_neighbourhood)
## all$host_neighbourhood 
##        n  missing distinct 
##    44532     7131      439 
## 
## lowest : 7th Arrondissement Abbotsford         Adams Point        Alésia             Algirós           
## highest: X Arrondissement   XI Arrondissement  Yehuda Hamaccabi   Yorkville          Zona 8
all%>%
  count(host_neighbourhood, sort=T)%>%
  ggplot(aes(reorder(host_neighbourhood, n),n))+
  geom_col()+
  coord_flip()+
  theme_minimal()+
  theme(axis.text.y = element_text(size=2))+
  labs(x='Unique categories', y='Times appears in dataset')

Hmisc::describe(all$neighbourhood)
## all$neighbourhood 
##        n  missing distinct 
##    51585       78      202 
## 
## lowest : Allerton        Alphabet City   Annadale        Arden Heights   Arrochar       
## highest: Willowbrook     Windsor Terrace Woodhaven       Woodlawn        Woodside
Hmisc::describe(all$neighbourhood_cleansed)
## all$neighbourhood_cleansed 
##        n  missing distinct 
##    51663        0      220 
## 
## lowest : Allerton        Arden Heights   Arrochar        Arverne         Astoria        
## highest: Willowbrook     Windsor Terrace Woodhaven       Woodlawn        Woodside

Distribution of neighbourhood_cleansed

all%>%
  count(neighbourhood_cleansed, sort=T)%>%
  ggplot(aes(reorder(neighbourhood_cleansed, n),n))+
  geom_col()+
  coord_flip()+
  theme_minimal()+
  theme(axis.text.y = element_text(size=2))+
  labs(x='Unique categories', y='Times appears in dataset')

all%>%
  count(neighbourhood_cleansed, sort=T)%>%
  mutate(total = sum(n),
         perc = n/total,
         cum_perc = cumsum(perc),
         nums = 1:n_distinct(neighbourhood_cleansed))%>%
  ggplot(aes(cum_perc, nums))+
  geom_line()+
  geom_vline(xintercept = .80, linetype='dashed')+
  labs(x='Cumulative Percentage',y='Number of unique factor levels', title='How many unique levels capture 80% of all observed levels?')+
  theme_minimal()

Hmisc::describe(all$neighbourhood_group_cleansed)
## all$neighbourhood_group_cleansed 
##        n  missing distinct 
##    51663        0        5 
## 
## lowest : Bronx         Brooklyn      Manhattan     Queens        Staten Island
## highest: Bronx         Brooklyn      Manhattan     Queens        Staten Island
##                                                                   
## Value              Bronx      Brooklyn     Manhattan        Queens
## Frequency           1289         21236         22138          6576
## Proportion         0.025         0.411         0.429         0.127
##                         
## Value      Staten Island
## Frequency            424
## Proportion         0.008
all%>%
  count(neighbourhood_group_cleansed, sort=T)%>%
  ggplot(aes(reorder(neighbourhood_group_cleansed, n),n))+
  geom_col()+
  coord_flip()+
  theme_minimal()+
  theme(axis.text.y = element_text(size=8))+
  labs(x='Unique categories', y='Times appears in dataset')

####Convert neighbourhood_group_cleansed to factor

all<- all %>%
  mutate_at(c("neighbourhood_group_cleansed"),as.factor) 

Use vtreat impact encoding host_neighbourhood and neighbourhood_cleansed

library(vtreat)
dTrainN <- all[1:41330,]
# NOT RUN {
# numeric example
set.seed(23525)
# we perform a vtreat cross frame experiment
# and unpack the results into treatmentsN
# and dTrainNTreated
unpack[
  treatmentsN = treatments,
  dTrainNTreated = crossFrame
  ] <- mkCrossFrameNExperiment(
    dframe = dTrainN,
    varlist = setdiff(c("host_neighbourhood","neighbourhood_cleansed"), 'price'),
    outcomename = 'price',
    verbose = FALSE)

# the treatments include a score frame relating new
# derived variables to original columns
treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>%
  print(.)
##                  origName                                               varName
## 1      host_neighbourhood                               host_neighbourhood_catP
## 2      host_neighbourhood                               host_neighbourhood_catN
## 3      host_neighbourhood                               host_neighbourhood_catD
## 4  neighbourhood_cleansed                           neighbourhood_cleansed_catP
## 5  neighbourhood_cleansed                           neighbourhood_cleansed_catN
## 6  neighbourhood_cleansed                           neighbourhood_cleansed_catD
## 7      host_neighbourhood                             host_neighbourhood_lev_x_
## 8      host_neighbourhood                      host_neighbourhood_lev_x_Astoria
## 9      host_neighbourhood     host_neighbourhood_lev_x_Bedford_minus_Stuyvesant
## 10     host_neighbourhood                     host_neighbourhood_lev_x_Bushwick
## 11     host_neighbourhood                host_neighbourhood_lev_x_Crown_Heights
## 12     host_neighbourhood                       host_neighbourhood_lev_x_Harlem
## 13     host_neighbourhood          host_neighbourhood_lev_x_Hell_tick_s_Kitchen
## 14     host_neighbourhood              host_neighbourhood_lev_x_Upper_East_Side
## 15     host_neighbourhood              host_neighbourhood_lev_x_Upper_West_Side
## 16     host_neighbourhood                 host_neighbourhood_lev_x_Williamsburg
## 17 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Bedford_minus_Stuyvesant
## 18 neighbourhood_cleansed                 neighbourhood_cleansed_lev_x_Bushwick
## 19 neighbourhood_cleansed                  neighbourhood_cleansed_lev_x_Chelsea
## 20 neighbourhood_cleansed            neighbourhood_cleansed_lev_x_Crown_Heights
## 21 neighbourhood_cleansed              neighbourhood_cleansed_lev_x_East_Harlem
## 22 neighbourhood_cleansed             neighbourhood_cleansed_lev_x_East_Village
## 23 neighbourhood_cleansed               neighbourhood_cleansed_lev_x_Greenpoint
## 24 neighbourhood_cleansed                   neighbourhood_cleansed_lev_x_Harlem
## 25 neighbourhood_cleansed      neighbourhood_cleansed_lev_x_Hell_tick_s_Kitchen
## 26 neighbourhood_cleansed                  neighbourhood_cleansed_lev_x_Midtown
## 27 neighbourhood_cleansed          neighbourhood_cleansed_lev_x_Upper_East_Side
## 28 neighbourhood_cleansed          neighbourhood_cleansed_lev_x_Upper_West_Side
## 29 neighbourhood_cleansed             neighbourhood_cleansed_lev_x_Williamsburg
##    code          rsq           sig extraModelDegrees
## 1  catP 1.559984e-04  1.111069e-02               399
## 2  catN 1.523291e-01  0.000000e+00               399
## 3  catD 7.150960e-02  0.000000e+00               399
## 4  catP 1.262647e-04  2.234741e-02               217
## 5  catN 1.852532e-01  0.000000e+00               217
## 6  catD 1.471896e-01  0.000000e+00               217
## 7   lev 2.764632e-05  2.851102e-01                 0
## 8   lev 2.356971e-03  5.324265e-23                 0
## 9   lev 6.757545e-03  6.723372e-63                 0
## 10  lev 8.964951e-03  6.306274e-83                 0
## 11  lev 1.887332e-03  9.933154e-19                 0
## 12  lev 6.989494e-04  7.637321e-08                 0
## 13  lev 9.768384e-03  3.182998e-90                 0
## 14  lev 1.597463e-03  4.347630e-16                 0
## 15  lev 2.823125e-03  3.115498e-27                 0
## 16  lev 3.020449e-04  4.103105e-04                 0
## 17  lev 8.454522e-03  2.712710e-78                 0
## 18  lev 1.111700e-02 1.755041e-102                 0
## 19  lev 1.138014e-02 7.094750e-105                 0
## 20  lev 2.117851e-03  7.938611e-21                 0
## 21  lev 5.898481e-04  7.891580e-07                 0
## 22  lev 4.712387e-03  2.316619e-44                 0
## 23  lev 8.837375e-11  9.984752e-01                 0
## 24  lev 2.059180e-03  2.712338e-20                 0
## 25  lev 1.372127e-02 3.415772e-126                 0
## 26  lev 2.696492e-02 1.162390e-247                 0
## 27  lev 1.944887e-03  2.973161e-19                 0
## 28  lev 4.128700e-03  4.515167e-39                 0
## 29  lev 5.382014e-04  2.395449e-06                 0
# the treated frame is a "cross frame" which
# is a transform of the training data built 
# as if the treatment were learned on a different
# disjoint training set to avoid nested model
# bias and over-fit.
varnames <- treatmentsN$scoreFrame[treatmentsN$scoreFrame$code%in% c('catN'),'varName']
all1 <- prepare(treatmentsN, all, pruneSig=NULL,varRestriction = varnames)
all$host_neighbourhood <- all1$host_neighbourhood_catN
all$neighbourhood_cleansed <- all1$neighbourhood_cleansed_catN
all[,c("host_neighbourhood")] <- sapply(all[,c("host_neighbourhood")],as.numeric)
all[,c("neighbourhood_cleansed")] <- sapply(all[,c("neighbourhood_cleansed")],as.numeric)
which(is.na(all$neighbourhood_cleansed))
## integer(0)

5.2.13 Min/Max nights

cor(train$maximum_nights,train$price)
## [1] 0.0002808924
cor(train$minimum_nights,train$price)
## [1] 0.006085157
cor(train$maximum_nights_avg_ntm,train$price)
## [1] 0.02807403
ggplot(train,aes(minimum_nights_avg_ntm,price)) + 
  geom_point()

6 Feature Engernering (generate new features)

6.1 use first_review and last review to create new feature “review_time”

Hmisc::describe(all$first_review)
## all$first_review 
##        n  missing distinct 
##    51661        2     3253 
## 
## lowest : 2009-02-23 2009-03-12 2009-04-13 2009-04-20 2009-04-23
## highest: 2020-06-04 2020-06-05 2020-06-06 2020-06-07 2020-06-08
Hmisc::describe(all$last_review)
## all$last_review 
##        n  missing distinct 
##    51661        2     2093 
## 
## lowest : 2011-05-12 2011-08-29 2011-09-18 2011-09-19 2011-10-23
## highest: 2020-06-05 2020-06-06 2020-06-07 2020-06-08 2020-06-09
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
all$first_review <- ymd(all[,c("first_review")])
all$last_review <- ymd(all[,c("last_review")])
all<-  all %>%
  mutate(diff_day = as.double(difftime(all$last_review,all$first_review,units = "days"))) 
for (i in 1:nrow(all)){
        if(is.na(all$diff_day[i])){
              all$diff_day[i] <- median(all$diff_day,na.rm = TRUE)
        }
}

6.2 amenities

amenities are very good feature for our model. Intuitively, the amenities may have correlation with price. But the question is, in what aspects?

My guess are the number of amenities, and the type of amenities.

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following object is masked from 'package:wrapr':
## 
##     :=
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
all <- all %>%
  mutate(amenities = gsub("\\[", "",amenities)) %>% ## remove [
  mutate(amenities = gsub("\\]", "",amenities)) %>% ## remove ]
  mutate(amenities = gsub("\\'", "",amenities)) %>% ## remove '
  mutate(amenities = gsub("\\.", "",amenities)) %>% ## remove '
  mutate(amenities = str_replace_all(amenities,"\\s", "")) %>%
  mutate(amenities = noquote(amenities)) %>%
  mutate(amenities = tolower(amenities)) %>%
  mutate(number_amenities = sapply(strsplit(amenities,','), uniqueN))
cor(all$number_amenities,all$price,use="complete.obs")
## [1] 0.1807215
library(qdapTools)
## 
## Attaching package: 'qdapTools'
## The following object is masked from 'package:data.table':
## 
##     shift
## The following object is masked from 'package:dplyr':
## 
##     id
all = cbind(all,mtabulate(strsplit(as.character(all$amenities), ',')))
head(all$amenities)
## [1] tv,wifi,airconditioning,kitchen,heating,washer,dryer,smokedetector,carbonmonoxidedetector,fireextinguisher,essentials,hangers,hairdryer,selfcheck-in,keypad,privateentrance                                                                                                                                                                                                                                                                                                  
## [2] tv,internet,wifi,kitchen,petsliveonthisproperty,dog(s),freestreetparking,buzzer/wirelessintercom,heating,family/kidfriendly,washer,dryer,smokedetector,carbonmonoxidedetector,fireextinguisher,hairdryer,iron,laptopfriendlyworkspace,translationmissing:enhosting_amenity_50,selfcheck-in,lockbox,privatelivingroom,hotwater,microwave,coffeemaker,refrigerator,dishwasher,dishesandsilverware,cookingbasics,oven,stove,patioorbalcony,gardenorbackyard,bakingsheet,trashcan
## [3] tv,wifi,airconditioning,kitchen,heating,washer,dryer,smokedetector,carbonmonoxidedetector,firstaidkit,fireextinguisher,essentials,shampoo,lockonbedroomdoor,hangers,hairdryer,iron,laptopfriendlyworkspace,selfcheck-in,smartlock,privatelivingroom,privateentrance,room-darkeningshades,hotwater,bedlinens,microwave,coffeemaker,refrigerator,dishwasher,dishesandsilverware,cookingbasics,oven,stove,gardenorbackyard                                                      
## [4] wifi,airconditioning,kitchen,buzzer/wirelessintercom,heating,smokedetector,carbonmonoxidedetector,essentials,shampoo,hangers,hairdryer,iron,hotwater,microwave,refrigerator,dishesandsilverware,cookingbasics,oven,stove,hostgreetsyou                                                                                                                                                                                                                                       
## [5] wifi,airconditioning,kitchen,freestreetparking,heating,washer,dryer,smokedetector,carbonmonoxidedetector,firstaidkit,fireextinguisher,essentials,shampoo,lockonbedroomdoor,hangers,hairdryer,iron,laptopfriendlyworkspace,privatelivingroom,privateentrance,hotwater,microwave,coffeemaker,refrigerator,dishesandsilverware,cookingbasics,oven,stove                                                                                                                         
## [6] wifi,airconditioning,kitchen,heating,smokedetector,firstaidkit,essentials,hangers,hairdryer,iron,laptopfriendlyworkspace,privatelivingroom,coffeemaker
all <- all %>%
  mutate_at(c("amenities"),as.character) 
all1 <- all %>%
  mutate(amenities = str_split(amenities, ",")) %>%
  select(amenities)
words.freq<-table(unlist(all1));
words.freq<-sort(words.freq, decreasing=TRUE)
#names(words.freq)[71:153]
all = all[, !(colnames(all) %in% names(words.freq)[71:153])]
dim(all)
## [1] 51663   134
head(all)
##       id host_since host_response_time host_response_rate host_acceptance_rate
## 1 914965 2019-06-13 within a few hours                70%                  N/A
## 2 262763 2011-06-08     within an hour                86%                  25%
## 3 279053 2013-01-20       within a day               100%                  66%
## 4 385319 2014-08-18     within an hour               100%                  98%
## 5 476558 2018-09-14     within an hour               100%                 100%
## 6 522365 2019-05-31       within a day                83%                  90%
##   host_is_superhost host_neighbourhood host_listings_count
## 1                 0         -49.759367                   0
## 2                 0           1.451335                   1
## 3                 1           3.577991                   1
## 4                 0          72.433628                   3
## 5                 1         -55.268907                   6
## 6                 1         -35.597427                   1
##   host_total_listings_count
## 1                         0
## 2                         1
## 3                         1
## 4                         3
## 5                         6
## 6                         1
##                                                                          host_verifications
## 1                                                                                 ['phone']
## 2         ['email', 'phone', 'facebook', 'reviews', 'jumio', 'government_id', 'work_email']
## 3                                   ['email', 'phone', 'reviews', 'jumio', 'government_id']
## 4                                   ['email', 'phone', 'reviews', 'jumio', 'government_id']
## 5                                                                        ['email', 'phone']
## 6 ['phone', 'jumio', 'offline_government_id', 'selfie', 'government_id', 'identity_manual']
##   host_identity_verified    neighbourhood neighbourhood_cleansed
## 1                      0         Brooklyn             -51.220420
## 2                      1         Brooklyn               3.873703
## 3                      0 Prospect Heights               3.873703
## 4                      1        Manhattan              71.085886
## 5                      0        Sunnyside             -58.480477
## 6                      0           Queens             -38.990225
##   neighbourhood_group_cleansed zipcode is_location_exact property_type
## 1                     Brooklyn   11207                 1   Condominium
## 2                     Brooklyn   11238                 0     Apartment
## 3                     Brooklyn   11238                 1   Guest suite
## 4                    Manhattan   10018                 1     Apartment
## 5                       Queens   11104                 1         House
## 6                       Queens   11103                 1     Apartment
##         room_type accommodates bathrooms bedrooms beds bed_type
## 1 Entire home/apt            5       1.5        2    2 Real Bed
## 2 Entire home/apt            4       1.0        2    1 Real Bed
## 3 Entire home/apt            3       1.0        1    1 Real Bed
## 4    Private room            2       1.0        1    1 Real Bed
## 5    Private room            2       1.5        1    1 Real Bed
## 6    Private room            2       1.0        1    1 Real Bed
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                       amenities
## 1                                                                                                                                                                                                                                                                                                   tv,wifi,airconditioning,kitchen,heating,washer,dryer,smokedetector,carbonmonoxidedetector,fireextinguisher,essentials,hangers,hairdryer,selfcheck-in,keypad,privateentrance
## 2 tv,internet,wifi,kitchen,petsliveonthisproperty,dog(s),freestreetparking,buzzer/wirelessintercom,heating,family/kidfriendly,washer,dryer,smokedetector,carbonmonoxidedetector,fireextinguisher,hairdryer,iron,laptopfriendlyworkspace,translationmissing:enhosting_amenity_50,selfcheck-in,lockbox,privatelivingroom,hotwater,microwave,coffeemaker,refrigerator,dishwasher,dishesandsilverware,cookingbasics,oven,stove,patioorbalcony,gardenorbackyard,bakingsheet,trashcan
## 3                                                       tv,wifi,airconditioning,kitchen,heating,washer,dryer,smokedetector,carbonmonoxidedetector,firstaidkit,fireextinguisher,essentials,shampoo,lockonbedroomdoor,hangers,hairdryer,iron,laptopfriendlyworkspace,selfcheck-in,smartlock,privatelivingroom,privateentrance,room-darkeningshades,hotwater,bedlinens,microwave,coffeemaker,refrigerator,dishwasher,dishesandsilverware,cookingbasics,oven,stove,gardenorbackyard
## 4                                                                                                                                                                                                                                        wifi,airconditioning,kitchen,buzzer/wirelessintercom,heating,smokedetector,carbonmonoxidedetector,essentials,shampoo,hangers,hairdryer,iron,hotwater,microwave,refrigerator,dishesandsilverware,cookingbasics,oven,stove,hostgreetsyou
## 5                                                                                                                          wifi,airconditioning,kitchen,freestreetparking,heating,washer,dryer,smokedetector,carbonmonoxidedetector,firstaidkit,fireextinguisher,essentials,shampoo,lockonbedroomdoor,hangers,hairdryer,iron,laptopfriendlyworkspace,privatelivingroom,privateentrance,hotwater,microwave,coffeemaker,refrigerator,dishesandsilverware,cookingbasics,oven,stove
## 6                                                                                                                                                                                                                                                                                                                        wifi,airconditioning,kitchen,heating,smokedetector,firstaidkit,essentials,hangers,hairdryer,iron,laptopfriendlyworkspace,privatelivingroom,coffeemaker
##   price cleaning_fee guests_included extra_people minimum_nights maximum_nights
## 1   145          110               1            0              6             29
## 2   199           50               2           50              2           1125
## 3   165           75               2           50              2             12
## 4    59           49               1           25             30           1125
## 5    48           25               1           15              2           1125
## 6    80           45               1            0              2             30
##   minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights
## 1                      6                      6                     29
## 2                      2                      2                   1125
## 3                      1                      3                     12
## 4                     30                     30                   1125
## 5                      2                      2                   1125
## 6                      2                      2                     30
##   maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm
## 1                     29                      6                     29
## 2                   1125                      2                   1125
## 3                     12                      2                     12
## 4                   1125                     30                   1125
## 5                   1125                      2                   1125
## 6                     30                      2                     30
##   calendar_updated availability_30 availability_60 availability_90
## 1      2 weeks ago               0               0               0
## 2      3 weeks ago               8              38              68
## 3        yesterday              17              28              48
## 4      4 weeks ago              11              11              11
## 5     2 months ago               0               0              12
## 6      4 weeks ago               5              35              65
##   availability_365 number_of_reviews number_of_reviews_ltm first_review
## 1               18                 2                     2   2019-07-15
## 2              158                 5                     4   2017-04-11
## 3              112                52                    45   2019-01-21
## 4               35                30                    29   2018-05-19
## 5               73                14                    14   2019-05-07
## 6               65                20                    20   2019-07-01
##   last_review review_scores_rating review_scores_accuracy
## 1  2019-07-20                 10.0                     10
## 2  2019-12-31                  9.2                     10
## 3  2020-03-01                 10.0                     10
## 4  2020-01-01                  9.4                      9
## 5  2020-01-01                  9.9                     10
## 6  2020-01-01                 10.0                     10
##   review_scores_cleanliness review_scores_checkin review_scores_communication
## 1                        10                    10                          10
## 2                        10                    10                          10
## 3                        10                    10                          10
## 4                         9                     9                          10
## 5                         9                    10                          10
## 6                        10                    10                          10
##   review_scores_location review_scores_value instant_bookable
## 1                     10                  10                0
## 2                     10                   8                0
## 3                     10                  10                0
## 4                     10                   9                0
## 5                     10                  10                1
## 6                     10                  10                1
##           cancellation_policy require_guest_profile_picture
## 1 strict_14_with_grace_period                             f
## 2 strict_14_with_grace_period                             f
## 3 strict_14_with_grace_period                             f
## 4 strict_14_with_grace_period                             f
## 5 strict_14_with_grace_period                             f
## 6                    flexible                             f
##   require_guest_phone_verification calculated_host_listings_count
## 1                                f                              1
## 2                                f                              1
## 3                                f                              1
## 4                                f                              2
## 5                                f                              3
## 6                                f                              1
##   calculated_host_listings_count_entire_homes
## 1                                           1
## 2                                           1
## 3                                           1
## 4                                           0
## 5                                           0
## 6                                           0
##   calculated_host_listings_count_private_rooms
## 1                                            0
## 2                                            0
## 3                                            0
## 4                                            2
## 5                                            3
## 6                                            1
##   calculated_host_listings_count_shared_rooms reviews_per_month  zipcode_m
## 1                                           0              2.00 -59.044505
## 2                                           0              0.14  -2.906385
## 3                                           0              3.72  -1.465574
## 4                                           0              1.35  91.877609
## 5                                           0              1.34 -54.118285
## 6                                           0              2.33 -40.282820
##   diff_day number_amenities 24-hourcheck-in airconditioning bathtub bbqgrill
## 1        5               16               0               1       0        0
## 2      994               35               0               0       0        0
## 3      405               34               0               1       0        0
## 4      592               20               0               1       0        0
## 5      239               28               0               1       0        0
## 6      184               13               0               1       0        0
##   bedlinens breakfast buzzer/wirelessintercom cabletv carbonmonoxidedetector
## 1         0         0                       0       0                      1
## 2         0         0                       1       0                      1
## 3         1         0                       0       0                      1
## 4         0         0                       1       0                      1
## 5         0         0                       0       0                      1
## 6         0         0                       0       0                      0
##   children’sbooksandtoys cleaningbeforecheckout coffeemaker cookingbasics
## 1                      0                      0           0             0
## 2                      0                      0           1             1
## 3                      0                      0           1             1
## 4                      0                      0           0             1
## 5                      0                      0           1             1
## 6                      0                      0           1             0
##   dishesandsilverware dishwasher doorman dryer elevator essentials
## 1                   0          0       0     1        0          1
## 2                   1          1       0     1        0          0
## 3                   1          1       0     1        0          1
## 4                   1          0       0     0        0          1
## 5                   1          0       0     1        0          1
## 6                   0          0       0     0        0          1
##   ethernetconnection extrapillowsandblankets family/kidfriendly
## 1                  0                       0                  0
## 2                  0                       0                  1
## 3                  0                       0                  0
## 4                  0                       0                  0
## 5                  0                       0                  0
## 6                  0                       0                  0
##   fireextinguisher firstaidkit freeparkingonpremises freestreetparking
## 1                1           0                     0                 0
## 2                1           0                     0                 1
## 3                1           1                     0                 0
## 4                0           0                     0                 0
## 5                1           1                     0                 1
## 6                0           1                     0                 0
##   gardenorbackyard gym hairdryer hangers heating hostgreetsyou hotwater
## 1                0   0         1       1       1             0        0
## 2                1   0         1       0       1             0        1
## 3                1   0         1       1       1             0        1
## 4                0   0         1       1       1             1        1
## 5                0   0         1       1       1             0        1
## 6                0   0         1       1       1             0        0
##   indoorfireplace internet iron keypad kitchen laptopfriendlyworkspace lockbox
## 1               0        0    0      1       1                       0       0
## 2               0        1    1      0       1                       1       1
## 3               0        0    1      0       1                       1       0
## 4               0        0    1      0       1                       0       0
## 5               0        0    1      0       1                       1       0
## 6               0        0    1      0       1                       1       0
##   lockonbedroomdoor longtermstaysallowed luggagedropoffallowed microwave
## 1                 0                    0                     0         0
## 2                 0                    0                     0         1
## 3                 1                    0                     0         1
## 4                 0                    0                     0         1
## 5                 1                    0                     0         1
## 6                 0                    0                     0         0
##   nostairsorstepstoenter other oven pack’nplay/travelcrib
## 1                      0     0    0                     0
## 2                      0     0    1                     0
## 3                      0     0    1                     0
## 4                      0     0    1                     0
## 5                      0     0    1                     0
## 6                      0     0    0                     0
##   paidparkingoffpremises patioorbalcony petsallowed petsliveonthisproperty
## 1                      0              0           0                      0
## 2                      0              1           0                      1
## 3                      0              0           0                      0
## 4                      0              0           0                      0
## 5                      0              0           0                      0
## 6                      0              0           0                      0
##   privateentrance privatelivingroom refrigerator room-darkeningshades
## 1               1                 0            0                    0
## 2               0                 1            1                    0
## 3               1                 1            1                    1
## 4               0                 0            1                    0
## 5               1                 1            1                    0
## 6               0                 1            0                    0
##   safetycard selfcheck-in shampoo showergel singlelevelhome smokedetector
## 1          0            1       0         0               0             1
## 2          0            1       0         0               0             1
## 3          0            1       1         0               0             1
## 4          0            0       1         0               0             1
## 5          0            0       1         0               0             1
## 6          0            0       0         0               0             1
##   smokingallowed stove translationmissing:enhosting_amenity_49
## 1              0     0                                       0
## 2              0     1                                       0
## 3              0     1                                       0
## 4              0     1                                       0
## 5              0     1                                       0
## 6              0     0                                       0
##   translationmissing:enhosting_amenity_50 trashcan tv washer wifi
## 1                                       0        0  1      1    1
## 2                                       1        1  1      1    1
## 3                                       0        0  1      1    1
## 4                                       0        0  0      0    1
## 5                                       0        0  0      1    1
## 6                                       0        0  0      0    1

6.3 Bedroom

6.4 Total rooms

all<- all %>%
  mutate(total_room = bathrooms+bedrooms)
head(all$total_room)
## [1] 3.5 3.0 2.0 2.0 2.5 2.0

6.5 create longitude & latitude

Read zip_code_database.csv, and join two tables to get longitude and latitude.

zipcode_info = read.csv("/Users/chensixian/Desktop/CU AA/5200/Kaggle/rentlala2021/zip_code_database.csv")
head(zipcode_info)
##     zip     type decommissioned  primary_city acceptable_cities
## 1 82001 STANDARD              0      Cheyenne     Fe Warren AFB
## 2 82002   UNIQUE              0      Cheyenne                  
## 3 82003   PO BOX              0      Cheyenne                  
## 4 82005 STANDARD              0 Fe Warren AFB          Cheyenne
## 5 82006   UNIQUE              0      Cheyenne                  
## 6 82007 STANDARD              0      Cheyenne                  
##      unacceptable_cities state         county       timezone area_codes
## 1                           WY Laramie County America/Denver        307
## 2       State Of Wyoming    WY Laramie County America/Denver        307
## 3                           WY Laramie County America/Denver        307
## 4                           WY Laramie County America/Denver        307
## 5 Wy State Game And Fish    WY Laramie County America/Denver        307
## 6                           WY Laramie County America/Denver        307
##   world_region country latitude longitude irs_estimated_population
## 1         <NA>      US    41.14   -104.79                    32090
## 2         <NA>      US    41.14   -104.79                       25
## 3         <NA>      US    41.14   -104.79                     3208
## 4         <NA>      US    41.14   -104.87                      594
## 5         <NA>      US    41.14   -104.79                        0
## 6         <NA>      US    41.08   -104.71                    17660

Transform numerical zip code into string

zipcode_info$zip<- sapply(zipcode_info$zip,as.character)
head(zipcode_info)
##     zip     type decommissioned  primary_city acceptable_cities
## 1 82001 STANDARD              0      Cheyenne     Fe Warren AFB
## 2 82002   UNIQUE              0      Cheyenne                  
## 3 82003   PO BOX              0      Cheyenne                  
## 4 82005 STANDARD              0 Fe Warren AFB          Cheyenne
## 5 82006   UNIQUE              0      Cheyenne                  
## 6 82007 STANDARD              0      Cheyenne                  
##      unacceptable_cities state         county       timezone area_codes
## 1                           WY Laramie County America/Denver        307
## 2       State Of Wyoming    WY Laramie County America/Denver        307
## 3                           WY Laramie County America/Denver        307
## 4                           WY Laramie County America/Denver        307
## 5 Wy State Game And Fish    WY Laramie County America/Denver        307
## 6                           WY Laramie County America/Denver        307
##   world_region country latitude longitude irs_estimated_population
## 1         <NA>      US    41.14   -104.79                    32090
## 2         <NA>      US    41.14   -104.79                       25
## 3         <NA>      US    41.14   -104.79                     3208
## 4         <NA>      US    41.14   -104.87                      594
## 5         <NA>      US    41.14   -104.79                        0
## 6         <NA>      US    41.08   -104.71                    17660

Use left join.

all<- left_join(x = all, y = zipcode_info[ , c("zip", "latitude","longitude")], by = c("zipcode"="zip"))
all$longitude <- abs(all$longitude)
ggplot(all,aes(longitude)) +
  geom_histogram(bins=100)

ggplot(all[1:41330,],aes(longitude,latitude,color = price)) +
  geom_point() +
  ylim(40.5, 41) +
  xlim(73, 74.5)
## Warning: Removed 1 rows containing missing values (geom_point).

6.6 Final feature

coln <- c("host_since","host_response_rate","host_acceptance_rate","neighbourhood",
          "host_verifications","neighbourhood","zipcode","amenities","minimum_minimum_nights","maximum_nights",
          "maximum_minimum_nights","minimum_maximum_nights","maximum_maximum_nights","maximum_nights_avg_ntm","maximum_nights_avg_ntm","calendar_updated","first_review","last_review","require_guest_profile_picture",
          "require_guest_phone_verification")
all1 <- all[,!names(all) %in% coln]

Split all data into train1 and test1, they have the same row numbers as the original dataset.

train1 <- all1[1:41330,]
test1 <- all1[is.na(all1$price),]
which(is.na(train1))
## integer(0)

7 Build Model

7.1 Random Forest

7.2 Xgboost

One-hot encoding all factors.

library(vtreat)
trt = designTreatmentsZ(dframe = train1,
                        varlist = names(all1))
## [1] "vtreat 1.6.3 inspecting inputs Mon Nov 29 11:05:24 2021"
## [1] "designing treatments Mon Nov 29 11:05:24 2021"
## [1] " have initial level statistics Mon Nov 29 11:05:24 2021"
## [1] " scoring treatments Mon Nov 29 11:05:26 2021"
## [1] "have treatment plan Mon Nov 29 11:05:26 2021"
newvars = trt$scoreFrame[trt$scoreFrame$code%in% c('clean','lev'),'varName']

train1 = prepare(treatmentplan = trt, 
                      dframe = train1,
                      varRestriction = newvars)
test1 = prepare(treatmentplan = trt, 
                     dframe = test1,
                     varRestriction = newvars)
head(train1)
##       id host_is_superhost host_neighbourhood host_listings_count
## 1 914965                 0         -49.759367                   0
## 2 262763                 0           1.451335                   1
## 3 279053                 1           3.577991                   1
## 4 385319                 0          72.433628                   3
## 5 476558                 1         -55.268907                   6
## 6 522365                 1         -35.597427                   1
##   host_total_listings_count host_identity_verified neighbourhood_cleansed
## 1                         0                      0             -51.220420
## 2                         1                      1               3.873703
## 3                         1                      0               3.873703
## 4                         3                      1              71.085886
## 5                         6                      0             -58.480477
## 6                         1                      0             -38.990225
##   is_location_exact accommodates bathrooms bedrooms beds price cleaning_fee
## 1                 1            5       1.5        2    2   145          110
## 2                 0            4       1.0        2    1   199           50
## 3                 1            3       1.0        1    1   165           75
## 4                 1            2       1.0        1    1    59           49
## 5                 1            2       1.5        1    1    48           25
## 6                 1            2       1.0        1    1    80           45
##   guests_included extra_people minimum_nights minimum_nights_avg_ntm
## 1               1            0              6                      6
## 2               2           50              2                      2
## 3               2           50              2                      2
## 4               1           25             30                     30
## 5               1           15              2                      2
## 6               1            0              2                      2
##   availability_30 availability_60 availability_90 availability_365
## 1               0               0               0               18
## 2               8              38              68              158
## 3              17              28              48              112
## 4              11              11              11               35
## 5               0               0              12               73
## 6               5              35              65               65
##   number_of_reviews number_of_reviews_ltm review_scores_rating
## 1                 2                     2                 10.0
## 2                 5                     4                  9.2
## 3                52                    45                 10.0
## 4                30                    29                  9.4
## 5                14                    14                  9.9
## 6                20                    20                 10.0
##   review_scores_accuracy review_scores_cleanliness review_scores_checkin
## 1                     10                        10                    10
## 2                     10                        10                    10
## 3                     10                        10                    10
## 4                      9                         9                     9
## 5                     10                         9                    10
## 6                     10                        10                    10
##   review_scores_communication review_scores_location review_scores_value
## 1                          10                     10                  10
## 2                          10                     10                   8
## 3                          10                     10                  10
## 4                          10                     10                   9
## 5                          10                     10                  10
## 6                          10                     10                  10
##   instant_bookable calculated_host_listings_count
## 1                0                              1
## 2                0                              1
## 3                0                              1
## 4                0                              2
## 5                1                              3
## 6                1                              1
##   calculated_host_listings_count_entire_homes
## 1                                           1
## 2                                           1
## 3                                           1
## 4                                           0
## 5                                           0
## 6                                           0
##   calculated_host_listings_count_private_rooms
## 1                                            0
## 2                                            0
## 3                                            0
## 4                                            2
## 5                                            3
## 6                                            1
##   calculated_host_listings_count_shared_rooms reviews_per_month  zipcode_m
## 1                                           0              2.00 -59.044505
## 2                                           0              0.14  -2.906385
## 3                                           0              3.72  -1.465574
## 4                                           0              1.35  91.877609
## 5                                           0              1.34 -54.118285
## 6                                           0              2.33 -40.282820
##   diff_day number_amenities X24_minus_hourcheck_minus_in airconditioning
## 1        5               16                            0               1
## 2      994               35                            0               0
## 3      405               34                            0               1
## 4      592               20                            0               1
## 5      239               28                            0               1
## 6      184               13                            0               1
##   bathtub bbqgrill bedlinens breakfast buzzer_slash_wirelessintercom cabletv
## 1       0        0         0         0                             0       0
## 2       0        0         0         0                             1       0
## 3       0        0         1         0                             0       0
## 4       0        0         0         0                             1       0
## 5       0        0         0         0                             0       0
## 6       0        0         0         0                             0       0
##   carbonmonoxidedetector children_sbooksandtoys cleaningbeforecheckout
## 1                      1                      0                      0
## 2                      1                      0                      0
## 3                      1                      0                      0
## 4                      1                      0                      0
## 5                      1                      0                      0
## 6                      0                      0                      0
##   coffeemaker cookingbasics dishesandsilverware dishwasher doorman dryer
## 1           0             0                   0          0       0     1
## 2           1             1                   1          1       0     1
## 3           1             1                   1          1       0     1
## 4           0             1                   1          0       0     0
## 5           1             1                   1          0       0     1
## 6           1             0                   0          0       0     0
##   elevator essentials ethernetconnection extrapillowsandblankets
## 1        0          1                  0                       0
## 2        0          0                  0                       0
## 3        0          1                  0                       0
## 4        0          1                  0                       0
## 5        0          1                  0                       0
## 6        0          1                  0                       0
##   family_slash_kidfriendly fireextinguisher firstaidkit freeparkingonpremises
## 1                        0                1           0                     0
## 2                        1                1           0                     0
## 3                        0                1           1                     0
## 4                        0                0           0                     0
## 5                        0                1           1                     0
## 6                        0                0           1                     0
##   freestreetparking gardenorbackyard gym hairdryer hangers heating
## 1                 0                0   0         1       1       1
## 2                 1                1   0         1       0       1
## 3                 0                1   0         1       1       1
## 4                 0                0   0         1       1       1
## 5                 1                0   0         1       1       1
## 6                 0                0   0         1       1       1
##   hostgreetsyou hotwater indoorfireplace internet iron keypad kitchen
## 1             0        0               0        0    0      1       1
## 2             0        1               0        1    1      0       1
## 3             0        1               0        0    1      0       1
## 4             1        1               0        0    1      0       1
## 5             0        1               0        0    1      0       1
## 6             0        0               0        0    1      0       1
##   laptopfriendlyworkspace lockbox lockonbedroomdoor longtermstaysallowed
## 1                       0       0                 0                    0
## 2                       1       1                 0                    0
## 3                       1       0                 1                    0
## 4                       0       0                 0                    0
## 5                       1       0                 1                    0
## 6                       1       0                 0                    0
##   luggagedropoffallowed microwave nostairsorstepstoenter other oven
## 1                     0         0                      0     0    0
## 2                     0         1                      0     0    1
## 3                     0         1                      0     0    1
## 4                     0         1                      0     0    1
## 5                     0         1                      0     0    1
## 6                     0         0                      0     0    0
##   pack_nplay_slash_travelcrib paidparkingoffpremises patioorbalcony petsallowed
## 1                           0                      0              0           0
## 2                           0                      0              1           0
## 3                           0                      0              0           0
## 4                           0                      0              0           0
## 5                           0                      0              0           0
## 6                           0                      0              0           0
##   petsliveonthisproperty privateentrance privatelivingroom refrigerator
## 1                      0               1                 0            0
## 2                      1               0                 1            1
## 3                      0               1                 1            1
## 4                      0               0                 0            1
## 5                      0               1                 1            1
## 6                      0               0                 1            0
##   room_minus_darkeningshades safetycard selfcheck_minus_in shampoo showergel
## 1                          0          0                  1       0         0
## 2                          0          0                  1       0         0
## 3                          1          0                  1       1         0
## 4                          0          0                  0       1         0
## 5                          0          0                  0       1         0
## 6                          0          0                  0       0         0
##   singlelevelhome smokedetector smokingallowed stove
## 1               0             1              0     0
## 2               0             1              0     1
## 3               0             1              0     1
## 4               0             1              0     1
## 5               0             1              0     1
## 6               0             1              0     0
##   translationmissing_colon_enhosting_amenity_49
## 1                                             0
## 2                                             0
## 3                                             0
## 4                                             0
## 5                                             0
## 6                                             0
##   translationmissing_colon_enhosting_amenity_50 trashcan tv washer wifi
## 1                                             0        0  1      1    1
## 2                                             1        1  1      1    1
## 3                                             0        0  1      1    1
## 4                                             0        0  0      0    1
## 5                                             0        0  0      1    1
## 6                                             0        0  0      0    1
##   total_room latitude longitude host_response_time_lev_x_a_few_days_or_more
## 1        3.5    40.67     73.89                                           0
## 2        3.0    40.68     73.96                                           0
## 3        2.0    40.68     73.96                                           0
## 4        2.0    40.76     73.99                                           0
## 5        2.5    40.74     73.92                                           0
## 6        2.0    40.76     73.91                                           0
##   host_response_time_lev_x_unknown host_response_time_lev_x_within_a_day
## 1                                0                                     0
## 2                                0                                     0
## 3                                0                                     1
## 4                                0                                     0
## 5                                0                                     0
## 6                                0                                     1
##   host_response_time_lev_x_within_a_few_hours
## 1                                           1
## 2                                           0
## 3                                           0
## 4                                           0
## 5                                           0
## 6                                           0
##   host_response_time_lev_x_within_an_hour
## 1                                       0
## 2                                       1
## 3                                       0
## 4                                       1
## 5                                       1
## 6                                       0
##   neighbourhood_group_cleansed_lev_x_Bronx
## 1                                        0
## 2                                        0
## 3                                        0
## 4                                        0
## 5                                        0
## 6                                        0
##   neighbourhood_group_cleansed_lev_x_Brooklyn
## 1                                           1
## 2                                           1
## 3                                           1
## 4                                           0
## 5                                           0
## 6                                           0
##   neighbourhood_group_cleansed_lev_x_Manhattan
## 1                                            0
## 2                                            0
## 3                                            0
## 4                                            1
## 5                                            0
## 6                                            0
##   neighbourhood_group_cleansed_lev_x_Queens
## 1                                         0
## 2                                         0
## 3                                         0
## 4                                         0
## 5                                         1
## 6                                         1
##   neighbourhood_group_cleansed_lev_x_Staten_Island
## 1                                                0
## 2                                                0
## 3                                                0
## 4                                                0
## 5                                                0
## 6                                                0
##   property_type_lev_x_Apartment property_type_lev_x_Boutique_hotel
## 1                             0                                  0
## 2                             1                                  0
## 3                             0                                  0
## 4                             1                                  0
## 5                             0                                  0
## 6                             1                                  0
##   property_type_lev_x_Condominium property_type_lev_x_Guest_suite
## 1                               1                               0
## 2                               0                               0
## 3                               0                               1
## 4                               0                               0
## 5                               0                               0
## 6                               0                               0
##   property_type_lev_x_Hotel property_type_lev_x_House property_type_lev_x_Loft
## 1                         0                         0                        0
## 2                         0                         0                        0
## 3                         0                         0                        0
## 4                         0                         0                        0
## 5                         0                         1                        0
## 6                         0                         0                        0
##   property_type_lev_x_Other property_type_lev_x_Townhouse
## 1                         0                             0
## 2                         0                             0
## 3                         0                             0
## 4                         0                             0
## 5                         0                             0
## 6                         0                             0
##   room_type_lev_x_Entire_home_slash_apt room_type_lev_x_Hotel_room
## 1                                     1                          0
## 2                                     1                          0
## 3                                     1                          0
## 4                                     0                          0
## 5                                     0                          0
## 6                                     0                          0
##   room_type_lev_x_Private_room room_type_lev_x_Shared_room
## 1                            0                           0
## 2                            0                           0
## 3                            0                           0
## 4                            1                           0
## 5                            1                           0
## 6                            1                           0
##   bed_type_lev_x_Airbed bed_type_lev_x_Couch bed_type_lev_x_Futon
## 1                     0                    0                    0
## 2                     0                    0                    0
## 3                     0                    0                    0
## 4                     0                    0                    0
## 5                     0                    0                    0
## 6                     0                    0                    0
##   bed_type_lev_x_Pull_minus_out_Sofa bed_type_lev_x_Real_Bed
## 1                                  0                       1
## 2                                  0                       1
## 3                                  0                       1
## 4                                  0                       1
## 5                                  0                       1
## 6                                  0                       1
##   cancellation_policy_lev_x_flexible cancellation_policy_lev_x_moderate
## 1                                  0                                  0
## 2                                  0                                  0
## 3                                  0                                  0
## 4                                  0                                  0
## 5                                  0                                  0
## 6                                  1                                  0
##   cancellation_policy_lev_x_strict
## 1                                0
## 2                                0
## 3                                0
## 4                                0
## 5                                0
## 6                                0
##   cancellation_policy_lev_x_strict_14_with_grace_period
## 1                                                     1
## 2                                                     1
## 3                                                     1
## 4                                                     1
## 5                                                     1
## 6                                                     0
##   cancellation_policy_lev_x_super_strict_30
## 1                                         0
## 2                                         0
## 3                                         0
## 4                                         0
## 5                                         0
## 6                                         0
##   cancellation_policy_lev_x_super_strict_60
## 1                                         0
## 2                                         0
## 3                                         0
## 4                                         0
## 5                                         0
## 6                                         0
label_train <- train1$price
dtrain <- xgb.DMatrix(data = as.matrix(train1[, !names(train1) %in% c("price")] ), label= label_train)

After parameter tunning, I find max_depth =7 gave me the lowest local RMSE. And I also set gamma as 2 to prevent overfitting.

default_param<-list(
        objective = "reg:linear",
        booster = "gbtree",
        eta=0.05, #default = 0.3
        gamma=2,
        max_depth=7, #default=6
        min_child_weight=4, #default=1
        subsample=1,
        colsample_bytree=1
)

I set nround to 1000, so that model can find the best iteration.

xgbcv <- xgb.cv( params = default_param, data = dtrain, nrounds = 1000, nfold = 5, showsd = T, stratified = T, print_every_n = 40, early_stopping_rounds = 10, maximize = F)
## [11:05:28] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [11:05:28] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [11:05:29] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [11:05:29] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [11:05:30] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:168.488867+0.721354  test-rmse:168.573300+2.922263 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [41] train-rmse:58.966072+0.246791   test-rmse:66.362495+1.830743 
## [81] train-rmse:48.706135+0.145505   test-rmse:60.748808+1.615544 
## [121]    train-rmse:45.214314+0.119480   test-rmse:59.761356+1.525036 
## [161]    train-rmse:43.053410+0.183099   test-rmse:59.305290+1.483833 
## [201]    train-rmse:41.374628+0.140586   test-rmse:59.032665+1.453495 
## [241]    train-rmse:39.953196+0.125722   test-rmse:58.873120+1.449741 
## [281]    train-rmse:38.659590+0.170006   test-rmse:58.739704+1.467543 
## [321]    train-rmse:37.375632+0.234299   test-rmse:58.642889+1.479525 
## [361]    train-rmse:36.270510+0.227390   test-rmse:58.581929+1.486024 
## [401]    train-rmse:35.284778+0.265717   test-rmse:58.521047+1.467086 
## [441]    train-rmse:34.259458+0.307754   test-rmse:58.471801+1.451686 
## Stopping. Best iteration:
## [443]    train-rmse:34.213709+0.314700   test-rmse:58.469186+1.449435

[424] train-rmse:35.121716+0.256536 test-rmse:58.969470+1.625035 [687] train-rmse:35.260908+0.088415 test-rmse:58.728155+1.636479 [771] train-rmse:34.294465+0.081300 test-rmse:58.859532+1.554717 [467] train-rmse:33.863055+0.325591 test-rmse:58.503620+1.433310

xgb_mod <- xgb.train(data = dtrain, params=default_param, nrounds = 467)
## [11:16:14] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
#XGBpred <- predict(xgb_mod, dtest)
#head(XGBpred)
dtrain <- xgb.DMatrix(data = as.matrix(train1[, !names(train1) %in% c("price")] ))
XGBpred_t <- predict(xgb_mod, dtrain)
head(XGBpred_t)
## [1] 166.94334 207.57115 157.97151  86.39487  57.63774  77.51135
dtest <- xgb.DMatrix(data = as.matrix(test1[, !names(test1) %in% c("price")] ))
XGBpred <- predict(xgb_mod, dtest)
head(XGBpred)
## [1] 143.81493  52.97718  90.75443 197.08588 244.56882 129.34155

Although some would argue Xgboost model has low interpretability, the plot shows some of the most important features in the Xgboost model.

library(Ckmeans.1d.dp) #required for ggplot clustering
mat <- xgb.importance (feature_names = colnames(dtest),model = xgb_mod)
xgb.ggplot.importance(importance_matrix = mat[1:30], rel_to_first = TRUE)

library(tidyverse) 
my_submission <- data_frame('id' = ScoringData$id, 'price' = XGBpred)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
my_submission
## # A tibble: 10,333 x 2
##       id price
##    <int> <dbl>
##  1 10310 144. 
##  2 10426  53.0
##  3 10583  90.8
##  4 10973 197. 
##  5 11000 245. 
##  6 11161 129. 
##  7 11262  97.6
##  8 11561  97.2
##  9 11589 280. 
## 10 11625 270. 
## # … with 10,323 more rows
# save our file
write_csv(my_submission, 'submission_xgboost_v11_logtransform_-999_bin_1.csv')